WAF skills¶
Four skills cover the connector lifecycle for WAF sources. Each carries a WAF-specific reference. The procedural body of each skill is at Connector skills.
analyze-source: WAF reference¶
Facts the analyze-source skill needs to write a complete Reference section for a WAF source. WAF sources project edge-event records into finding-shape rows on silver.findings per the trufflehog convention — severity is derived from action, status is the literal open, and finding_id is a deterministic SHA-256 hash so re-deliveries collapse at MERGE.
Applicable REQ-IDs¶
From mkdocs/docs/platform/reference/catalog.md. WAF sources emit edge-event records that are projected into finding-shape rows on silver.findings.
- Apply:
REQ-ING-AUTH,REQ-ING-HWM,REQ-TRF-MAP,REQ-TRF-SEV,REQ-TRF-TS,REQ-DQ. REQ-ING-PAGandREQ-ING-RLapply only when the connector consumes a paginated SDK surface (for example sampled-request SDK calls); for log-stream consumption (the preferred mode), these are N/A.REQ-TRF-STSapplies in degraded form — WAF events have no native lifecycle, so the connector emits the literalopenper the trufflehog convention;status_canonicalnever transitions. The catalog matrix marks thisN/A(matching the trufflehog row).REQ-DEDUPstays N/A on the catalog matrix — WAF events still don't share dedup tuples with SAST/SCA/secrets/DAST. Replay deduplication is achieved via the deterministicfinding_idhash plus the Bronze→Silver MERGE; nodedup_linksrows are emitted.
Default severity¶
medium. Severity is not a first-class field on a WAF event; the canonical severity is derived from the action field via an action-keyed lookup table (e.g. block→high, count→medium, allow→low) — analogous to the secrets convention but data-driven from action rather than fixed.
The Reference section's Enumerations fact MUST disclose the derivation rule (action → canonical severity) and list every documented action value.
Incremental strategy¶
Timestamp-based high-water mark over the log stream. The connector records the last event-time ingested per WebACL or rule group and advances the window forward on each run.
For AWS deployments the reference pattern is Firehose to S3 or CloudWatch Logs into Bronze via an autoloader-style ingestion. For on-prem appliances the same pattern applies over the forwarded syslog bucket. Sampled SDK calls (for example GetSampledRequests) are supported as a fallback only — the WAF specification requires the connector to PREFER log-stream consumption over sampled SDK calls because samples lose fidelity under high-volume rules.
Deduplication key¶
REQ-DEDUP is N/A on the catalog matrix — WAF rows do not share dedup tuples with SAST/SCA/secrets/DAST findings, so no dedup_links rows are emitted. Replay deduplication (recovering from re-delivered events) is achieved instead by the deterministic finding_id SHA-256 hash of (webacl_arn, request_id, timestamp_ms) plus the Bronze→Silver MERGE — re-delivered events collapse onto the same finding_id at MERGE time.
Target Silver tables¶
silver.findings — the canonical findings table, same target as SAST/SCA/secrets/DAST. WAF events are projected into finding-shape rows: each event becomes one finding row with severity derived from action (via the action-keyed lookup), status set to the literal open (no native lifecycle), and a deterministic finding_id SHA-256 hashed from (webacl_arn, request_id, timestamp_ms).
WAF events have no native repository_id, so repository_id is null on the emitted rows. Gold-side aggregations bucket WAF findings under the __UNMAPPED__ application sentinel until an operator extends silver.app_repo_mapping with a webacl_arn → application_id mapping (out of scope for the MVP).
WAF-specific telemetry that is NOT carried on silver.findings — source_ip, country, http_method, response_code, sampling_weight, rule_type, and the action value itself — is intentionally dropped from the canonical record. Operators query upstream WAF logs (S3 / CloudWatch) for that telemetry. The headline schema deviation that previously distinguished WAF (a dedicated silver.waf_events table) has been collapsed.
Authentication norms¶
Account-scoped, NOT per-tenant. Cloud-native WAFs (AWS WAF, Cloudflare, Azure Front Door WAF) authenticate via IAM role or access key bound to the cloud account hosting the WebACLs. On-prem appliances (F5 ASM, Imperva, ModSecurity) authenticate via a service credential bound to the log-aggregation tier. There is no per-application authentication axis.
Ingestion-tooling preference¶
Standard preference order applies: Lakeflow Connect > Databricks SDK > dlt. For AWS WAF, autoloader-style ingestion from the Firehose-to-S3 prefix is the canonical pattern — this fits the Lakeflow Connect / SDK envelope. SDK-based sampled-request fallback is permitted only when full-log ingestion is not yet provisioned, with the statistical sampling weight preserved into Bronze for downstream extrapolation.
Quirks¶
- Finding-shape on
silver.findings. WAF now follows the trufflehog convention: each WAF event becomes one finding row onsilver.findings(the same canonical table SAST/SCA/secrets/DAST target). The previous schema deviation (a dedicatedsilver.waf_eventstable) has been collapsed. - Severity is derived. Severity comes from the
actionfield via an action-keyed lookup (block→high,count→medium,allow→low, etc.). The lookup is action-keyed, not severity-keyed. - Status is the literal
open. WAF events have no native lifecycle; the connector follows the trufflehog convention and writes the literalopentostatus_canonical.status_canonicalnever transitions. - Deterministic
finding_id. Each row'sfinding_idis a deterministic SHA-256 hash of(webacl_arn, request_id, timestamp_ms). Re-deliveries collapse at MERGE time. - WAF-only telemetry is dropped.
source_ip,country,http_method,response_code,sampling_weight,rule_type, and theactionvalue itself are NOT carried onsilver.findings. Operators query upstream WAF logs (S3 / CloudWatch) for that detail. - Sampling weight. Where the source returns statistical samples (sampled SDK calls), each record carries a sampling weight that MUST be preserved into Bronze for downstream extrapolation (it is not projected onto
silver.findings). - Application linkage is deferred. WAF events have no native
repository_id;repository_idis null on emitted rows. Gold-side aggregations bucket WAF findings under the__UNMAPPED__application sentinel until an operator extendssilver.app_repo_mappingwith awebacl_arn → application_idmapping (out of scope for the MVP). - Action vocabulary. Documented actions include
block,allow,count,challenge,captcha. The Reference section MUST list every action the source emits — this drives the severity-derivation lookup. - Log-stream over SDK. Prefer log-stream consumption over sampled SDK calls; the Reference section's Quirks fact MUST disclose the chosen mode and justify any deviation.
Rendered from .claude/skills/analyze-source/references/waf.md. Source-of-truth lives in the skill file.
provision-source: WAF reference¶
Facts the provision-source skill needs to emit the source-side runtime for a WAF source. WAF connectors follow a bucket-policy-only runtime shape (canonical follower: AWS WAF). The operator provisions the WebACL, the Kinesis Firehose delivery stream, and the destination S3 bucket out of band; the runtime wires those external resources into the connector by attaching the Firehose-write bucket policy and surfacing the bronze schema name and bucket ARN as outputs.
Runtime shape¶
runtime_provisioner: terraform-aws-bucket-policy. Provider stack: hashicorp/aws + databricks/databricks (the latter for versions.tf parity, currently unused — kept for forward-compatibility if a future revision adds UC Volume / external location bindings).
Resources / data sources:
data "aws_s3_bucket" "waf_logs"— references the operator-supplied bucket (does NOT create it). Bucket name is parsed out of the ARN viaelement(split(":::", var.aws_waf_log_bucket_arn), 1).aws_s3_bucket_policy.waf_logs_firehose— bucket policy granting the Firehose service principal (firehose.amazonaws.com)s3:PutObject+s3:PutObjectAclon${var.aws_waf_log_bucket_arn}/*, conditioned onaws:SourceAccount = var.aws_waf_account_id. SidAllowFirehoseWrite.
It does not create the WebACL, the Firehose delivery stream, or the S3 bucket. Those are operator prerequisites.
operational.yml.source_runtime fields¶
Required: runtime_provisioner (always terraform-aws-bucket-policy for WAF), catalog_var_name, bronze_schema_name (default bronze_aws_waf), aws_region_var_name, aws_account_id_var_name, log_bucket_arn_var_name. Optional with category defaults: aws_region_default (us-east-1), firehose_service_principal (firehose.amazonaws.com), firehose_actions (["s3:PutObject", "s3:PutObjectAcl"]), bucket_policy_sid (AllowFirehoseWrite), secret_keys_external (["waf_log_bucket", "aws_waf_iam_role_arn"] — loaded by scripts/load-secrets.sh, NOT by Terraform), sample_artefact_path (runtime/files/sample.json), terraform_required_version (>= 1.5).
Variables exposed¶
Required: catalog, aws_waf_account_id, aws_waf_log_bucket_arn. Optional: aws_region (default us-east-1).
Outputs¶
bronze_schema_full_name (= ${var.catalog}.bronze_aws_waf), s3_bucket_arn (echo of the operator-supplied bucket ARN).
Operator-authored sidecar¶
One runtime/files/* reference: runtime/files/sample.json — a sanitised representative WAFv2 log record. Each S3 object delivered by Firehose contains one or more records in this form, separated by newlines (typically gzipped). The bronze envelope (sql/event_envelope.sql) lands the raw payload as a string and extracts the WebACL ID at ingest time for joinability. Operator-authored — the skill emits the README reference but never the file body.
runtime/install.sh shape¶
terraform init + terraform apply -auto-approve wrapper, with TF_VAR exports for CATALOG, AWS_WAF_ACCOUNT_ID, AWS_WAF_LOG_BUCKET_ARN (e.g. arn:aws:s3:::my-org-waf-logs). Optional override: AWS_REGION.
Prerequisites: WAFv2 enabled in $AWS_WAF_ACCOUNT_ID, fronting CloudFront, ALB, or API Gateway; WebACL configured with logging enabled, sending logs via Kinesis Firehose to the target S3 bucket; the target bucket exists and is owned by the operator (in the same account as the Firehose); AWS credentials usable from Terraform with permissions to attach an S3 bucket policy on the target bucket; for runtime ingestion, AWS credentials with s3:GetObject on the log bucket loaded into the Databricks mvp-connectors scope via bash scripts/load-secrets.sh.
Page §Source provisioning section template¶
Inserted after ## User inputs and before ## Secrets. Section heading: ## Optional source runtime. Body explains that the module wires the operator-owned S3 bucket that Kinesis Firehose delivers WAFv2 log records to into the connector — the runtime does not create the WebACL, Firehose, or bucket; what it does create is the S3 bucket policy granting the Firehose service principal write access, scoped via aws:SourceAccount. Documents the apply command (one-liner against catalog, aws_waf_account_id, aws_waf_log_bucket_arn), with a cross-link to runtime/files/sample.json for the log-record format reference.
Secrets-out-of-Terraform note (carried into the page): secret values for the WAF connector (
waf_log_bucket,aws_waf_iam_role_arn) live in the Databricksmvp-connectorsscope and are loaded byscripts/load-secrets.sh. They do NOT flow through this Terraform module —main.tfonly manages the S3 bucket policy. Keeping secret values out of Terraform state is intentional.
Teardown caveat¶
terraform destroy removes the bucket policy only. The underlying bucket and any log objects already delivered to it are not managed by this module. Delete them out of band if no longer needed. The WebACL and Firehose delivery stream are also not managed by this module.
Rendered from .claude/skills/provision-source/references/waf.md. Source of truth lives in the skill file.
generate-connector: WAF reference¶
Facts the generate-connector skill needs to emit a WAF connector module. WAF sources project edge-event records into finding-shape rows on silver.findings per the trufflehog convention — severity is derived from action, status is the literal open, and finding_id is a deterministic SHA-256 hash so re-deliveries collapse at MERGE.
Applicable REQ-IDs¶
From mkdocs/docs/platform/reference/catalog.md. Bind one test function per REQ-ID below.
- Bind:
REQ-ING-AUTH,REQ-ING-HWM,REQ-TRF-MAP,REQ-TRF-SEV,REQ-TRF-TS,REQ-DQ. REQ-ING-PAGandREQ-ING-RLapply only when the connector consumes a paginated SDK surface (e.g.GetSampledRequests); for log-stream consumption (the preferred mode), they are N/A.REQ-TRF-STSapplies in degraded form — bind a test asserting thatstatus_canonicalis the literalopen(per the trufflehog convention) and never transitions, since WAF events have no native lifecycle. The catalog matrix marks thisN/A(matching the trufflehog row's pattern for literal-status sources).REQ-DEDUPstays N/A — WAF rows do not share dedup tuples with SAST/SCA/secrets/DAST findings, so the connector emits nodedup_linksrows. Replay deduplication is achieved via the deterministicfinding_idhash plus the Bronze→Silver MERGE; bind a single test asserting that re-delivered events collapse onto the samefinding_id.
Default severity¶
medium. Severity is derived, not source-supplied — there is no severity field on a WAF event. The canonical severity is computed from the action field (block / allow / count / challenge / captcha) via an action-keyed lookup.
The config/severity/{source}.yml lookup is therefore action-keyed, not severity-keyed. Generate the lookup with action-to-severity mappings covering every documented action value (e.g. block: high, count: medium, allow: low, challenge: low, captcha: low). The mapping.yml severity field references the lookup with action as the source path:
The configurable default for unmatched actions is medium with a data-quality warning.
Incremental strategy¶
Timestamp-based HWM over the log stream. Encode in config.yml:
- The connector records the last event-time ingested per WebACL or rule group and advances forward on each run.
- For AWS deployments: autoloader-style ingestion from a Firehose-to-S3 prefix or CloudWatch Logs export.
- For on-prem appliances: the same pattern over the forwarded syslog bucket.
- Sampled SDK calls (
GetSampledRequests) are a fallback only. Prefer log-stream consumption — samples lose fidelity under high-volume rules. Where the fallback is used, preserve the statistical sampling weight (Weightfield) into Bronze for downstream extrapolation (not ontosilver.findings).
Deduplication key¶
Per canonical mapping: REQ-DEDUP is N/A for WAF — WAF rows do not share dedup tuples with SAST/SCA/secrets/DAST findings, so the canonical dedup_links table does not get WAF rows. Cite mkdocs/docs/platform/reference/canonical-mapping.md in a transform-level comment to document the absence.
Replay deduplication (within-source, recovering from re-delivered events) is achieved by the deterministic finding_id: a SHA-256 hash of (webacl_arn, request_id, timestamp_ms) projected onto each row. Re-delivered events produce the same finding_id and collapse at the Bronze→Silver MERGE:
import hashlib
def derive_finding_id(webacl_arn: str, request_id: str, timestamp_ms: int) -> str:
"""Deterministic SHA-256 finding_id; re-delivered WAF events collapse at MERGE."""
payload = f"{webacl_arn}|{request_id}|{timestamp_ms}".encode("utf-8")
return hashlib.sha256(payload).hexdigest()
Do NOT emit dedup_links rows — the canonical dedup_links table targets cross-tool finding overlap, which WAF does not participate in.
Target Silver tables¶
silver.findings — the canonical findings table. WAF follows the trufflehog convention: each WAF event becomes one finding row on silver.findings, severity is derived from action via the action-keyed lookup, status is the literal open (no native lifecycle), and finding_id is the deterministic SHA-256 hash described above. The mapping.yml block targets silver.findings with the canonical envelope columns populated; cwe_id, cve_id, repository_id, file_path, and start_line are null. The previous schema deviation (a dedicated silver.waf_events table) has been collapsed; WAF now matches every other category's silver.findings target.
WAF events have no native repository_id; the connector emits null repository_id on every row. Gold-side aggregations bucket WAF findings under the __UNMAPPED__ application sentinel until an operator extends silver.app_repo_mapping with a webacl_arn → application_id mapping (out of scope for the MVP). Do NOT emit a transform-time join against silver.deployments; do NOT encode an ARN-to-application resolution in transform.py.
WAF-specific telemetry that is NOT carried on silver.findings — source_ip, country, http_method, response_code, sampling_weight, rule_type, and the action value itself — is intentionally dropped from the canonical record. Operators query upstream WAF logs (S3 / CloudWatch) for that detail; do not project these fields into silver.findings.
Authentication norms¶
Account-scoped, NOT per-tenant:
- Cloud-native WAFs (AWS WAF, Cloudflare, Azure Front Door WAF): IAM role or access key bound to the cloud account hosting the WebACLs.
- On-prem appliances (F5 ASM, Imperva, ModSecurity): service credential bound to the log-aggregation tier.
ingest.py reads credentials via the helper in src/common/; config.yml references the secret-scope key names. There is no per-application authentication axis; do not generate one.
Ingestion-tooling preference¶
Standard order: Lakeflow Connect → Databricks SDK → dlt.
- AWS WAF: autoloader-style ingestion from the Firehose-to-S3 prefix is the canonical pattern. This fits the Lakeflow Connect / SDK envelope.
- SDK-based sampled-request fallback: permitted only when full-log ingestion is not yet provisioned. Preserve the sampling weight into Bronze.
- The artefact-collection / autoloader pattern is the dominant WAF mode and aligns with Lakeflow Connect; no CLI-artefact deviation is needed.
Quirks¶
- Finding-shape on
silver.findings. Each WAF event projects to one finding row on the canonical findings table — same target as SAST/SCA/secrets/DAST. Emit a finding-only block inmapping.yml; reuse the canonical envelope columns. The previoussilver.waf_eventsschema deviation has been collapsed. - Severity is derived. The
actionfield drives canonical severity through the action-keyed lookup. Generate the lookup as action-keyed; do NOT generate a severity-keyed lookup that mirrors a source severity field (there is none). - Status is the literal
open. Per the trufflehog convention, writeopentostatus_canonical; the field never transitions. Theconfig/status/{source}.ymllookup contains a comment to that effect. - Deterministic
finding_id. Project a SHA-256 hash of(webacl_arn, request_id, timestamp_ms)as thefinding_id. Re-delivered events collapse at the Bronze→Silver MERGE. - WAF-only telemetry is dropped. Do NOT project
source_ip,country,http_method,response_code,sampling_weight,rule_type, or theactionvalue itself ontosilver.findings. Operators query upstream WAF logs (S3 / CloudWatch) for that telemetry. - Sampling weight preserved in Bronze only. Where the source returns statistical samples, project the
Weightfield into Bronze (not ontosilver.findings). Downstream extrapolation depends on it. - Application linkage is deferred. WAF events have no native
repository_id; emitrepository_id = null. Gold-side aggregations bucket WAF findings under the__UNMAPPED__application sentinel until an operator extendssilver.app_repo_mappingwith awebacl_arn → application_idmapping (out of scope for the MVP). Do NOT emit asilver.deploymentsjoin intransform.py. - Action vocabulary. Documented actions include
block,allow,count,challenge,captcha. The severity lookup MUST cover every action the source emits — exhaustive over the documented vocabulary. - Log-stream over SDK. Prefer log-stream consumption. SDK sampled-request mode is fallback-only; document the deviation in a top-of-file comment in
ingest.pyif used.
Databricks-side production-shape¶
In addition to the eight-file core, generate-connector emits the Databricks-side production-shape for WAF connectors. The skill reads operational.yml.databricks_runtime to interpolate the templates.
The WAF databricks_runtime schema (reverse-engineered from the AWS WAF follower) covers seventeen fields: secret_scope, bronze_schema, bronze_tables, envelope_table (default event_envelope), cron_schedule (default 0 */15 * * * ? — every 15 min), uc_catalog_var, job_name (kebab-case, e.g. aws-waf-connector), default_target, default_catalog, secret_env_vars (e.g. WAF_LOG_BUCKET → waf_log_bucket, AWS_WAF_IAM_ROLE_ARN → aws_waf_iam_role_arn), extra_install_env_vars (typically required: AWS_WAF_ACCOUNT_ID, AWS_WAF_LOG_BUCKET_ARN, AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY), tool_source_label (discriminates WAF rows from other tools' findings on the canonical silver.findings table; the verify-step queries WHERE tool_source = '{tool_source_label}'), entry_wrappers (false — ingest runs in-notebook from ingest.py), ingestion_mode (log_stream or sdk_sampled; default log_stream), log_stream_prefix (default waf/firehose/), firehose_account_id_env (default AWS_WAF_ACCOUNT_ID), webacl_log_bucket_arn_env (default AWS_WAF_LOG_BUCKET_ARN).
What the production-shape adds on top of the eight-file core:
scripts/load-secrets.sh— populates the secret scope fromdatabricks_runtime.secret_env_vars. Iterates over the env-var/secret-key pairs and runsdatabricks secrets put-secretper pair.scripts/install.sh— streamlined three-step shape (load-secrets →databricks bundle run {job_name}→ echo verify). The runbook-grade verify-counts-via-SQL-warehouse flow lives in the docs page, not the install script.- Top-level
install.sh— orchestrator chainingruntime/install.sh→scripts/load-secrets.sh→databricks bundle deploy. WAF source-side runtime is mandatory for the log-stream path —runtime/install.shattaches the Firehose-write S3 bucket policy without which the autoloader has nothing to read. sql/<envelope>.sql— REQUIRED.CREATE TABLEshape (companion to the autoloader-managed bronze table). Autoloader reads gzipped JSON files from the S3 bucket and lands them with columnsraw_payload,webacl_id(extracted at ingest fromterminatingRuleArnorwebaclIdfor joinability),ingested_at,run_id. The transform projects this intosilver.findings.- No
*_entry.pywrappers —entry_wrappers=false. Theresources/job.ymlnotebook_pathpoints at../ingest.pydirectly. resources/extras — alongsideresources/{source}-job.yml(15-min cron, with an extraingestion_modejob parameterlog_stream | sdk_sampled), WAF emitsresources/schemas.yml(bronze only).resources/connection.ymlis N/A — the workspace AWS service credential reads S3 directly; no UC connection.resources/pipeline.ymlis N/A — notebook job, not Lakeflow Connect.resources/volumes.ymlis N/A — the workspace AWS service credential reads from the S3 prefix directly; AWS WAF does not emit a UC Volume.- Connector page §4–§7 templates — §Secrets (table mapping
secret_key↔env_varwith the workspace-AWS-service-credential note), §Run the job (notebook job named{job_name}with the Firehose-buffer-flush callout — Firehose buffers up to 5 minutes or 5 MiB, so smoke tests should generate blockable requests then wait ~5 minutes before the autoloader picks them up on the next 15-min tick), §Verify (Bronze count plus top-terminating-rules and severity-distribution aggregations againstsilver.findingsfiltered bytool_source = '{tool_source_label}'), and §Troubleshooting (no-records-after-buffer-flush, severity-canonical mis-derived fromaction, autoloader-prefix mismatch).
Rendered from .claude/skills/generate-connector/references/waf.md. Source-of-truth lives in the skill file.
validate-implementation: WAF reference¶
Facts the validate-implementation skill needs to populate the Validation table for a WAF connector. WAF sources project edge-event records into finding-shape rows on silver.findings per the trufflehog convention — severity is derived from action via the action-keyed lookup, status is the literal open, and finding_id is a deterministic SHA-256 hash so re-deliveries collapse at MERGE.
Applicable REQ-IDs¶
From mkdocs/docs/platform/reference/catalog.md § "Requirement catalog". The AWS WAF column of the traceability matrix is the MVP-built profile.
Apply (the test suite MUST have a @pytest.mark.requirement("REQ-...")-bound test for each):
REQ-ING-AUTHREQ-ING-HWMREQ-TRF-MAPREQ-TRF-SEVREQ-TRF-STS— degraded form: assertsstatus_canonicalis the literalopenon every emitted row and never transitions, per the trufflehog convention. The catalog matrix marks thisN/A(matching the trufflehog row's pattern for literal-status sources).REQ-TRF-TSREQ-DQ— also covers the deterministic-finding_idreplay-deduplication assertion (re-delivered events collapse onto the samefinding_idat MERGE).
Mark N/A:
REQ-ING-PAG, N/A: log-stream consumption (the preferred mode) has no API pagination. Apply only when the connector consumes a paginated SDK surface (e.g.GetSampledRequestsfallback); otherwise markN/Awith the rationale "log-stream mode has no API pagination".REQ-ING-RL, N/A: same rationale; log-stream consumption has no API rate limit. Apply only when the SDK fallback is in use.REQ-DEDUP, N/A on the catalog matrix: WAF rows do not share dedup tuples with SAST/SCA/secrets/DAST findings, so the connector emits nodedup_linksrows. Replay deduplication is achieved instead by the deterministicfinding_idSHA-256 hash of(webacl_arn, request_id, timestamp_ms)plus the Bronze→Silver MERGE — re-delivered events collapse onto the samefinding_idat MERGE time. That replay assertion is bound underREQ-DQ, notREQ-DEDUP.
Default severity¶
medium configurable default; severity is derived from the action field via an action-keyed lookup, not source-supplied. The REQ-TRF-SEV test asserts the action-keyed lookup covers every documented action (block, allow, count, challenge, captcha) and that undocumented actions fall through to medium with a data-quality warning.
Incremental strategy¶
Timestamp-based HWM over the log stream. The connector records the last event-time ingested per WebACL or rule group and advances forward each run. The test suite asserts HWM-resume behaviour under REQ-ING-HWM against the timestamp advancement; SDK-fallback mode preserves Weight into Bronze and is also asserted under REQ-ING-HWM plus REQ-TRF-MAP.
Deduplication key¶
Per mkdocs/docs/platform/reference/canonical-mapping.md: REQ-DEDUP is N/A for WAF — WAF rows do not share dedup tuples with SAST/SCA/secrets/DAST findings, so no dedup_links rows are emitted. Replay deduplication is instead achieved via the deterministic finding_id SHA-256 hash of (webacl_arn, request_id, timestamp_ms) plus the Bronze→Silver MERGE; re-delivered events collapse onto the same finding_id. The test suite asserts this collapse under REQ-DQ, NOT under REQ-DEDUP. The test suite does NOT emit dedup_links rows for WAF.
Target Silver tables¶
silver.findings — the canonical findings table, same target as SAST/SCA/secrets/DAST. The REQ-TRF-MAP test asserts the connector targets silver.findings with the canonical envelope columns populated; cwe_id, cve_id, repository_id, file_path, and start_line are null on every emitted row. The headline schema deviation that previously distinguished WAF (a dedicated silver.waf_events table) has been collapsed; WAF now matches every other category's silver.findings target.
The REQ-TRF-MAP test additionally asserts:
severity_canonicalis derived fromactionvia the action-keyed lookup atconfig/severity/{source}.yml.status_canonicalis the literalopenon every row (asserted underREQ-TRF-STSin degraded form).finding_idis a deterministic SHA-256 hash of(webacl_arn, request_id, timestamp_ms)so re-deliveries collapse at MERGE.repository_idis null on every row (WAF events have no native repository linkage).- WAF-specific telemetry (
source_ip,country,http_method,response_code,sampling_weight,rule_type, theactionvalue itself) is NOT projected ontosilver.findings. Operators query upstream WAF logs (S3 / CloudWatch) for that detail. - No transform-time join against
silver.deploymentsis emitted — application linkage is deferred to Gold-side aggregations, which bucket WAF findings under the__UNMAPPED__application sentinel until an operator extendssilver.app_repo_mappingwith awebacl_arn → application_idmapping (out of scope for the MVP).
Authentication norms¶
Account-scoped, NOT per-tenant. Cloud-native WAFs use IAM role or access key; on-prem appliances use a service credential bound to the log-aggregation tier. The test suite asserts credential resolution from the platform secret scope under REQ-ING-AUTH.
Ingestion-tooling preference¶
Standard order: Lakeflow Connect → Databricks SDK → dlt. Autoloader-style ingestion from the Firehose-to-S3 prefix (AWS WAF) fits the Lakeflow Connect / SDK envelope; no CLI-artefact deviation is needed. The validation suite verifies pagination and rate-limit absence under the N/A markings rather than asserting a tool-choice fact directly.
Quirks¶
- Finding-shape on
silver.findings.REQ-TRF-MAPasserts the connector targetssilver.findingswith the canonical envelope columns populated;cwe_id/cve_id/repository_id/file_path/start_lineare null. The previoussilver.waf_eventsschema deviation has been collapsed. The test fails if a dedicatedsilver.waf_eventstable is targeted. - Severity is derived.
REQ-TRF-SEVasserts the lookup is action-keyed, not severity-keyed. A severity-keyed lookup that mirrors a source severity field is aFAIL. - Status is the literal
open.REQ-TRF-STS(degraded form) assertsstatus_canonicalis the literalopenon every emitted row and never transitions, per the trufflehog convention. Theconfig/status/{source}.ymllookup contains a comment to that effect. - Deterministic
finding_id.REQ-DQassertsfinding_idis a SHA-256 hash of(webacl_arn, request_id, timestamp_ms)and that re-delivered events collapse onto the samefinding_idat the Bronze→Silver MERGE. - WAF-only telemetry is dropped.
REQ-TRF-MAPassertssource_ip,country,http_method,response_code,sampling_weight,rule_type, and theactionvalue itself are NOT projected ontosilver.findings. Operators query upstream WAF logs (S3 / CloudWatch) for that telemetry. - Sampling weight preserved in Bronze only. Where the SDK sampled-request fallback is in use,
REQ-TRF-MAPasserts theWeightfield is projected into Bronze (not ontosilver.findings). Downstream extrapolation depends on it. - Application linkage is deferred.
REQ-TRF-MAPassertsrepository_idis null on every row and that no transform-time join againstsilver.deploymentsis emitted. Gold-side aggregations bucket WAF findings under the__UNMAPPED__application sentinel until an operator extendssilver.app_repo_mappingwith awebacl_arn → application_idmapping (out of scope for the MVP). - Action vocabulary. Documented actions include
block,allow,count,challenge,captcha.REQ-TRF-SEVasserts coverage over the full action vocabulary the source emits. - Log-stream over SDK.
REQ-ING-PAGandREQ-ING-RLare bound only when the SDK fallback is in use. Log-stream-only deployments mark themN/Awith the rationale "log-stream mode has no API pagination/rate limit".
Rendered from .claude/skills/validate-implementation/references/waf.md. Source-of-truth lives in the skill file.