Markdown-firstYesPDFs source-only
OCROffconfig.yaml
Downstream PDF analysisNoMarkdown-derived
AI/LLM API usedNorule/heuristic + human review

RFQX Pipeline Architecture

What each script does, what is deterministic vs heuristic vs AI-assisted, what needs human review, and what happens when new customer documents are added.

Architecture & Script Responsibilities

RFQX Pipeline Architecture

How the RFQX repository works, what each script does, what is deterministic vs heuristic vs AI-assisted, and what happens when input documents change.

1. What RFQX Does

RFQX turns a customer RFQ document set into a supplier-side working environment: customer requirement review, supplier proposals, open points, estimation impact, an initial cybersecurity concept, an initial system/security design, and derived supplier system requirements — all with traceability back to Markdown-derived sources.

2. Input Folders

FolderPurpose
customer-input/pdf/Source customer PDFs (source-only; never analysed downstream).
customer-input/markdown-manual/Manually corrected Markdown.
customer-input/customer-feedback/Customer feedback CSVs (Phase 4 loop).
converted/markdown*/Markdown produced from the PDFs.

3. Markdown-First Rule

All analysis runs on Markdown, never on the PDFs directly. PDFs are converted to Markdown first; every requirement keeps its source Markdown file, section path and page reference. This keeps extraction auditable and stable.

4. OCR Policy

OCR is disabled (ocr_enabled: false in config.yaml). Pages with no extractable text are flagged for manual review rather than OCR'd. The built-in converter never performs OCR.

5. Main Scripts and Their Responsibility

See section 10 for the full table.

6. Generated Outputs

Requirements register, supplier proposals, open points, estimation impact, supplier system requirements + coverage, initial cybersecurity concept, system/security design, workflow state, the HTML site, and the evidence package.

7. What Happens When New Documents Are Added

See section 13 (re-run behaviour). In short: convert → quality-check → extract → regenerate register, proposals, open points, SSRs, concept → regenerate site → validate. Customer decisions and feedback are never invented.

Per-PDF Document Intelligence Re-run Behavior

When a new PDF is added and the pipeline is re-run:

Per-PDF Content / Section Filtering

8. What Is Deterministic / Rule-Based

These produce identical output for identical input.

9. What Is Heuristic

Heuristics are conservative and explainable, but they approximate engineering judgment and should be reviewed.

10. What Requires AI/LLM Judgment

Currently no script calls an AI/LLM API (see analysis/ai_assistance_gap_report.md). The outputs that would most benefit from AI assistance — nuanced requirement interpretation, bespoke supplier proposals, sharp clarification questions, and concept conclusions — are today produced by rules/heuristics or human review. An optional, controlled AI stage is designed in docs/ai_analysis_stage_design.md.

11. What Requires Human Review

12. Known Limitations

13. How to Re-run the Pipeline

python scripts/run_full_analysis.py                       # convert + extract (when new PDFs)
python scripts/generate_requirement_review_register.py    # register, proposals, open points, concept, SSRs
python scripts/generate_supplier_system_requirements.py   # SSR derivation + coverage + matrix
python scripts/generate_document_intelligence.py          # per-PDF records, traceability and impact data
python scripts/ingest_customer_feedback.py                # apply any customer feedback
python scripts/generate_expert_synthesis.py               # features/interfaces/capabilities/diagrams
python scripts/generate_html_site.py                      # build the site
python scripts/check_html_site.py                         # validate
python scripts/export_site_evidence.py                    # evidence package
python scripts/check_git_status.py                        # git status report

14. How to Validate the Result

python scripts/check_html_site.py must print "HTML site validation passed." It fails on missing pages/artifacts, OCR enabled, downstream PDF analysis, weak labels, insufficient SSR coverage, or unmapped active requirements without a disposition. Warnings (heuristic proposals, no AI stage, blocked SSRs, feedback not ingested) are expected and listed.

15. AI Assistance Gap Report

The current pipeline uses AI/LLM API: no. See analysis/ai_assistance_gap_report.md for the scan result, the current rule/heuristic outputs, and the outputs that would benefit from controlled AI assistance.

16. Optional AI Stage Design

The optional AI stage is design-only. See docs/ai_analysis_stage_design.md, prompts/, and schemas/. It reads Markdown-derived requirements only, never PDFs directly, returns schema-validated JSON, separates evidence from inference, marks confidence, and stays opt-in.

Script Responsibility Table

ScriptInputOutputDeterministic / Heuristic / AI-AssistedHuman Review Needed
convert_pdf_to_markdown.pycustomer PDFsMarkdownDeterministic (no OCR)Yes (partial pages)
check_converter_availability.pyenvironmentconverter reportDeterministicNo
check_markdown_quality.pyMarkdownquality reportDeterministic + heuristic thresholdsYes (low quality)
ingest_markdown.pyMarkdownchunksDeterministicNo
extract_requirements.pyMarkdownextracted_requirements.jsonHeuristic (pattern-based)Yes
generate_requirement_review_register.pyextracted reqs + mappingsreview register, proposals, open points, estimation, concept, design, workflow stateHeuristic defaults + deterministic assemblyYes (positions/proposals)
generate_supplier_system_requirements.pyreview register + models + open pointsSSRs, customer→supplier matrix, coverageHeuristic classification + deterministic clustering/mathsYes (SSR statements)
generate_document_intelligence.pymanifest + review register + SSRs + open points + estimationper-PDF document intelligence records, traceability matrix, diagrams, and page dataDeterministic + heuristic classification/scoringYes (document conclusions)
ingest_customer_feedback.pyfeedback CSVsstatus deltas, updated registerDeterministicYes (decisions are customer-owned)
generate_expert_synthesis.pyMarkdown-derived datafeatures/interfaces/capabilities/diagramsHeuristicYes
generate_html_site.pyall generated dataHTML siteDeterministicNo
check_html_site.pysite + datavalidation resultDeterministicNo
export_site_evidence.pysitescreenshots + zipDeterministic (Playwright optional)No
check_git_status.pygit repogit status reportDeterministicNo

None of the above currently call an AI/LLM API.

AI Assistance Gap

Show AI assistance gap report

AI / LLM Assistance Gap Report

Current pipeline uses AI/LLM API: No

A full scan of scripts/ finds no call to any AI/LLM API (no Anthropic/OpenAI SDK, no requests/HTTP call to a model endpoint, no local model invocation). All analysis is produced by deterministic code and conservative heuristics.

Which outputs are rule-based / heuristic today

OutputMethod
Requirement extractionHeuristic (pattern/keyword)
Supplier position defaultsHeuristic rules
Engineering expectation / supplier proposal textTemplated heuristics
Security relevance + capability/feature/interface mappingHeuristic keyword rules
Open pointsCurated topic matching
Estimation impact levelsHeuristic rules
Supplier system requirement derivationHeuristic classification + deterministic clustering
Cybersecurity concept conclusionsTemplated from aggregated data

Which outputs should be AI-assisted for better quality

  • Requirement interpretation (intent, ambiguity, implicit obligations).
  • Supplier proposal generation (bespoke, requirement-specific wording).
  • Customer clarification question generation (sharper, context-aware).
  • Supplier system requirement derivation (better clustering + statements).
  • Cybersecurity concept conclusion generation (assumptions, risks, gaps).

What structured prompts / schema are needed

Defined in docs/ai_analysis_stage_design.md, with prompts in prompts/ and JSON schemas in schemas/. The AI stage must return structured JSON validated against those schemas — never free text.

Recommendation

AI/LLM-assisted analysis should be added as an optional, controlled stage for: requirement interpretation, supplier proposal generation, customer clarification generation, system requirement derivation, and cybersecurity concept conclusions. It must read Markdown-derived requirements only, never analyse PDFs directly, return schema-validated JSON, separate explicit evidence from inference, and mark confidence. Until implemented, conclusions are only as good as the current rules/heuristics plus human review.

Optional AI Stage Design

Show optional AI stage design

Optional AI Analysis Stage — Design

Design only. No API calls are implemented. This defines an optional, controlled AI/LLM stage that augments the rule/heuristic pipeline without replacing its deterministic, auditable core.

Position in the Pipeline

Markdown-derived requirements
        │
        ▼
  (optional) AI Analysis Stage  ──►  structured JSON (schema-validated)
        │                               │
        ▼                               ▼
  merge into review register / SSRs / concept  ──►  site + validation

The AI stage runs after requirement extraction and before site generation. Its JSON output is merged as a suggestion layer; deterministic assembly, traceability and validation are unchanged. If the AI stage is disabled, the rule/heuristic defaults are used (current behaviour).

Hard Rules

  • Read Markdown-derived requirements only; never analyze PDFs directly.
  • Return structured JSON validated against the schemas in schemas/; never free text.
  • Separate explicit evidence from inference (every field tags evidence vs inferred).
  • Mark confidence (high / medium / low) on every judgment.
  • Preserve source traceability (source_markdown, page_reference) on every item.
  • Output is a suggestion: a human reviews before it becomes a committed position.

Agents and Outputs

AgentPromptSchemaProduces
Requirement Reviewprompts/requirement_review_agent.mdschemas/requirement_review.schema.jsonsupplier proposal + position + open points per requirement
System Requirement Derivationprompts/system_requirement_derivation_agent.mdschemas/system_requirement_derivation.schema.jsonSSR candidates with many-to-many traceability
Cybersecurity Conceptprompts/cybersecurity_concept_agent.mdschemas/cybersecurity_concept.schema.jsonconcept conclusions, assumptions, risks, gaps

Controls

  • Batching: process requirements in capped batches; deterministic IDs.
  • Determinism: temperature 0; cache by input hash so re-runs are stable.
  • Validation: reject any non-conforming JSON; fall back to heuristic defaults.
  • Cost/scope guard: AI stage is opt-in via config flag (default off).
  • No silent overwrite: AI output lands in a *_ai_suggested field; merge is explicit.
  • Supplier system requirement control: generate supplier system requirement
  • candidates only from Markdown-derived customer requirements, with explicit evidence, inference, confidence, and open points.

Inputs / Outputs Summary

  • Input: requirements/requirement_review_register.json (or extracted requirements), relevant Markdown excerpts.
  • Output: JSON files under an ai_suggestions/ folder, schema-validated, then merged on human approval.