Pipeline Architecture - RFQX Security Architecture Package

Markdown-firstYesPDFs source-only

OCROffconfig.yaml

Downstream PDF analysisNoMarkdown-derived

AI/LLM API usedNorule/heuristic + human review

RFQX Pipeline Architecture

What each script does, what is deterministic vs heuristic vs AI-assisted, what needs human review, and what happens when new customer documents are added.

Architecture & Script Responsibilities

RFQX Pipeline Architecture

How the RFQX repository works, what each script does, what is deterministic vs heuristic vs AI-assisted, and what happens when input documents change.

1. What RFQX Does

RFQX turns a customer RFQ document set into a supplier-side working environment: customer requirement review, supplier proposals, open points, estimation impact, an initial cybersecurity concept, an initial system/security design, and derived supplier system requirements — all with traceability back to Markdown-derived sources.

2. Input Folders

Folder	Purpose
`customer-input/pdf/`	Source customer PDFs (source-only; never analysed downstream).
`customer-input/markdown-manual/`	Manually corrected Markdown.
`customer-input/customer-feedback/`	Customer feedback CSVs (Phase 4 loop).
`converted/markdown*/`	Markdown produced from the PDFs.

3. Markdown-First Rule

All analysis runs on Markdown, never on the PDFs directly. PDFs are converted to Markdown first; every requirement keeps its source Markdown file, section path and page reference. This keeps extraction auditable and stable.

4. OCR Policy

OCR is disabled (ocr_enabled: false in config.yaml). Pages with no extractable text are flagged for manual review rather than OCR'd. The built-in converter never performs OCR.

5. Main Scripts and Their Responsibility

See section 10 for the full table.

6. Generated Outputs

Requirements register, supplier proposals, open points, estimation impact, supplier system requirements + coverage, initial cybersecurity concept, system/security design, workflow state, the HTML site, and the evidence package.

7. What Happens When New Documents Are Added

See section 13 (re-run behaviour). In short: convert → quality-check → extract → regenerate register, proposals, open points, SSRs, concept → regenerate site → validate. Customer decisions and feedback are never invented.

Per-PDF Document Intelligence Re-run Behavior

When a new PDF is added and the pipeline is re-run:

a new document intelligence record is created
a new per-PDF page is created
requirements from that PDF are grouped under that page
SSRs derived from that PDF are linked
open points caused by that PDF are linked
design/concept/estimation impacts are recalculated
overall dashboards are also updated

Per-PDF Content / Section Filtering

Section filters are generated from converted Markdown, extracted requirement

metadata, section references, and page references.

Parent section filters include child requirements because each requirement row

stores the selected section plus all parent section IDs.

New PDFs automatically get section filters when numbered headings or

table-of-contents-like entries are detected.

If headings are not reliable, the document page falls back to page-based

filtering.

PDFs are not analyzed directly downstream; they remain source evidence only.

8. What Is Deterministic / Rule-Based

PDF→Markdown conversion and page/section tagging.
Requirement extraction structure and IDs.
Requirement review register field assembly.
Supplier system requirement clustering and coverage maths.
Disposition assignment, traceability matrices, counts, coverage %.
Site generation and validation.

These produce identical output for identical input.

9. What Is Heuristic

Supplier position defaults (Accept / Accept-with-Assumption / Partial / …) from

keyword + mapping rules.

Security-relevance inference and capability/feature/interface mapping.
SSR category classification and clustering granularity.
Open-point topic matching and estimation impact levels.

Heuristics are conservative and explainable, but they approximate engineering judgment and should be reviewed.

10. What Requires AI/LLM Judgment

Currently no script calls an AI/LLM API (see analysis/ai_assistance_gap_report.md). The outputs that would most benefit from AI assistance — nuanced requirement interpretation, bespoke supplier proposals, sharp clarification questions, and concept conclusions — are today produced by rules/heuristics or human review. An optional, controlled AI stage is designed in docs/ai_analysis_stage_design.md.

11. What Requires Human Review

Low-confidence / human-review-flagged requirements.
Every supplier position before it is sent to the customer.
Responsibility (CIA/RASIC) decisions and open-point answers.
SSR statements before customer alignment.
The cybersecurity concept review with the Cybersecurity Manager and leads.

12. Known Limitations

Heuristic defaults are not a substitute for engineering review.
No AI stage is implemented; conclusions are only as good as the rules/heuristics.
Duplicate customer requirement IDs in the source are flagged Duplicate/Merged.
Screenshots require Playwright (not installed); export records the blocker.
Git remote is not configured; commits are local only.

13. How to Re-run the Pipeline

python scripts/run_full_analysis.py                       # convert + extract (when new PDFs)
python scripts/generate_requirement_review_register.py    # register, proposals, open points, concept, SSRs
python scripts/generate_supplier_system_requirements.py   # SSR derivation + coverage + matrix
python scripts/generate_document_intelligence.py          # per-PDF records, traceability and impact data
python scripts/ingest_customer_feedback.py                # apply any customer feedback
python scripts/generate_expert_synthesis.py               # features/interfaces/capabilities/diagrams
python scripts/generate_html_site.py                      # build the site
python scripts/check_html_site.py                         # validate
python scripts/export_site_evidence.py                    # evidence package
python scripts/check_git_status.py                        # git status report

14. How to Validate the Result

python scripts/check_html_site.py must print "HTML site validation passed." It fails on missing pages/artifacts, OCR enabled, downstream PDF analysis, weak labels, insufficient SSR coverage, or unmapped active requirements without a disposition. Warnings (heuristic proposals, no AI stage, blocked SSRs, feedback not ingested) are expected and listed.

15. AI Assistance Gap Report

The current pipeline uses AI/LLM API: no. See analysis/ai_assistance_gap_report.md for the scan result, the current rule/heuristic outputs, and the outputs that would benefit from controlled AI assistance.

16. Optional AI Stage Design

The optional AI stage is design-only. See docs/ai_analysis_stage_design.md, prompts/, and schemas/. It reads Markdown-derived requirements only, never PDFs directly, returns schema-validated JSON, separates evidence from inference, marks confidence, and stays opt-in.

Script Responsibility Table

Script	Input	Output	Deterministic / Heuristic / AI-Assisted	Human Review Needed
convert_pdf_to_markdown.py	customer PDFs	Markdown	Deterministic (no OCR)	Yes (partial pages)
check_converter_availability.py	environment	converter report	Deterministic	No
check_markdown_quality.py	Markdown	quality report	Deterministic + heuristic thresholds	Yes (low quality)
ingest_markdown.py	Markdown	chunks	Deterministic	No
extract_requirements.py	Markdown	extracted_requirements.json	Heuristic (pattern-based)	Yes
generate_requirement_review_register.py	extracted reqs + mappings	review register, proposals, open points, estimation, concept, design, workflow state	Heuristic defaults + deterministic assembly	Yes (positions/proposals)
generate_supplier_system_requirements.py	review register + models + open points	SSRs, customer→supplier matrix, coverage	Heuristic classification + deterministic clustering/maths	Yes (SSR statements)
generate_document_intelligence.py	manifest + review register + SSRs + open points + estimation	per-PDF document intelligence records, traceability matrix, diagrams, and page data	Deterministic + heuristic classification/scoring	Yes (document conclusions)
ingest_customer_feedback.py	feedback CSVs	status deltas, updated register	Deterministic	Yes (decisions are customer-owned)
generate_expert_synthesis.py	Markdown-derived data	features/interfaces/capabilities/diagrams	Heuristic	Yes
generate_html_site.py	all generated data	HTML site	Deterministic	No
check_html_site.py	site + data	validation result	Deterministic	No
export_site_evidence.py	site	screenshots + zip	Deterministic (Playwright optional)	No
check_git_status.py	git repo	git status report	Deterministic	No

None of the above currently call an AI/LLM API.

AI Assistance Gap

Show AI assistance gap report

AI / LLM Assistance Gap Report

Current pipeline uses AI/LLM API: No

A full scan of scripts/ finds no call to any AI/LLM API (no Anthropic/OpenAI SDK, no requests/HTTP call to a model endpoint, no local model invocation). All analysis is produced by deterministic code and conservative heuristics.

Which outputs are rule-based / heuristic today

Output	Method
Requirement extraction	Heuristic (pattern/keyword)
Supplier position defaults	Heuristic rules
Engineering expectation / supplier proposal text	Templated heuristics
Security relevance + capability/feature/interface mapping	Heuristic keyword rules
Open points	Curated topic matching
Estimation impact levels	Heuristic rules
Supplier system requirement derivation	Heuristic classification + deterministic clustering
Cybersecurity concept conclusions	Templated from aggregated data

Which outputs should be AI-assisted for better quality

Requirement interpretation (intent, ambiguity, implicit obligations).
Supplier proposal generation (bespoke, requirement-specific wording).
Customer clarification question generation (sharper, context-aware).
Supplier system requirement derivation (better clustering + statements).
Cybersecurity concept conclusion generation (assumptions, risks, gaps).

What structured prompts / schema are needed

Defined in docs/ai_analysis_stage_design.md, with prompts in prompts/ and JSON schemas in schemas/. The AI stage must return structured JSON validated against those schemas — never free text.

Recommendation

AI/LLM-assisted analysis should be added as an optional, controlled stage for: requirement interpretation, supplier proposal generation, customer clarification generation, system requirement derivation, and cybersecurity concept conclusions. It must read Markdown-derived requirements only, never analyse PDFs directly, return schema-validated JSON, separate explicit evidence from inference, and mark confidence. Until implemented, conclusions are only as good as the current rules/heuristics plus human review.

Optional AI Stage Design

Show optional AI stage design

Optional AI Analysis Stage — Design

Design only. No API calls are implemented. This defines an optional, controlled AI/LLM stage that augments the rule/heuristic pipeline without replacing its deterministic, auditable core.

Position in the Pipeline

Markdown-derived requirements
        │
        ▼
  (optional) AI Analysis Stage  ──►  structured JSON (schema-validated)
        │                               │
        ▼                               ▼
  merge into review register / SSRs / concept  ──►  site + validation

The AI stage runs after requirement extraction and before site generation. Its JSON output is merged as a suggestion layer; deterministic assembly, traceability and validation are unchanged. If the AI stage is disabled, the rule/heuristic defaults are used (current behaviour).

Hard Rules

Read Markdown-derived requirements only; never analyze PDFs directly.
Return structured JSON validated against the schemas in schemas/; never free text.
Separate explicit evidence from inference (every field tags evidence vs inferred).
Mark confidence (high / medium / low) on every judgment.
Preserve source traceability (source_markdown, page_reference) on every item.
Output is a suggestion: a human reviews before it becomes a committed position.

Agents and Outputs

Agent	Prompt	Schema	Produces
Requirement Review	`prompts/requirement_review_agent.md`	`schemas/requirement_review.schema.json`	supplier proposal + position + open points per requirement
System Requirement Derivation	`prompts/system_requirement_derivation_agent.md`	`schemas/system_requirement_derivation.schema.json`	SSR candidates with many-to-many traceability
Cybersecurity Concept	`prompts/cybersecurity_concept_agent.md`	`schemas/cybersecurity_concept.schema.json`	concept conclusions, assumptions, risks, gaps

Controls

Batching: process requirements in capped batches; deterministic IDs.
Determinism: temperature 0; cache by input hash so re-runs are stable.
Validation: reject any non-conforming JSON; fall back to heuristic defaults.
Cost/scope guard: AI stage is opt-in via config flag (default off).
No silent overwrite: AI output lands in a *_ai_suggested field; merge is explicit.
Supplier system requirement control: generate supplier system requirement

candidates only from Markdown-derived customer requirements, with explicit evidence, inference, confidence, and open points.

Inputs / Outputs Summary

Input: requirements/requirement_review_register.json (or extracted requirements), relevant Markdown excerpts.
Output: JSON files under an ai_suggestions/ folder, schema-validated, then merged on human approval.