GitLab¶
What this connector ingests¶
The GitLab connector plays a dual SCM and integrated security role analogous to GitHub, but reflects the GitLab architecture. As SCM source it populates silver.repositories from /projects, silver.commits from the commits endpoint for each project, silver.pull_requests from merge requests, and silver.branch_policies from protected branches. GitLab Secure (available on the Ultimate tier for both SaaS and self-managed) embeds SAST, Secret Detection, and Dependency Scanning in the CI pipeline and exposes results through a Vulnerabilities API. On Ultimate, the connector additionally writes into silver.findings with three category values (sast, secret, sca) from a single source. Without Ultimate, the same findings are available in SARIF or GitLab JSON format as CI pipeline artifacts, retrievable via the jobs artifacts endpoint.
Category: SCM plus platform integrated SAST / SCA · Integration pattern: SDK (python-gitlab)
Dependencies¶
- Depends on: platform set up (Phase 1 complete). Catalog,
mvp-connectorssecret scope, and thesilverschema must exist. See Setup platform. - No upstream connector dependency. GitLab is an SCM connector. Like GitHub it is a source of truth for
silver.repositories. Install at least one SCM connector (this one or GitHub) before any non-SCM connector.
User inputs¶
| Input | Where to obtain | Used as |
|---|---|---|
| GitLab tenant URL | Use https://gitlab.com for SaaS, or the FQDN of a self-hosted instance. For a throw-away demo tenant, run docker run -d --hostname gitlab.local -p 80:80 -p 443:443 -p 22:22 gitlab/gitlab-ce:latest and use the resulting URL. |
Env var GITLAB_BASE_URL consumed by src/connectors/gitlab/scripts/load-secrets.sh; persisted as the gitlab_base_url secret. Also passed as terraform var gitlab_host (host-only, no scheme) to the optional runtime module. |
| GitLab group ID | On the GitLab UI, navigate to your target group, then Settings → General, and click the copy icon next to "Group ID" (numeric, e.g. 12345678). |
Env var GITLAB_GROUP_ID; terraform var gitlab_group_id. |
| GitLab Personal Access Token | https://gitlab.com/-/user_settings/personal_access_tokens → "Add new token" with scopes read_api, read_repository, read_user, expiry 90 days. On self-hosted, use the equivalent path under your tenant. Group access tokens are accepted in place of personal tokens; on Ultimate, the same scopes also satisfy the Vulnerabilities API. |
Env var GITLAB_TOKEN consumed by load-secrets.sh; stored under secret-scope key gitlab_token. |
Ultimate tier and the Vulnerabilities API
The connector emits silver.findings rows from /projects/{id}/vulnerabilities, which requires GitLab Ultimate. On Free or Premium tiers that endpoint returns 403 Forbidden, the connector logs and skips it, and bronze_gitlab.vulnerabilities stays empty. Repository, commit, merge-request, and protected-branch ingestion (the SCM half of the dual role) work on every tier.
Optional source runtime¶
The optional runtime under src/connectors/gitlab/runtime/ is a references-only terraform module: it pins providers, declares the user inputs (catalog, gitlab_group_id, gitlab_host, gitlab_token_secret_scope, gitlab_token_secret_key), and exports the Bronze schema name and GitLab host as outputs for downstream bundle resolution. It does not provision a GitLab tenant, group, projects, or seed data — the GitLab side is user-provisioned (the GitLab Terraform provider supports group and project creation, but the MVP runtime intentionally stops short of that to avoid leaking demo data into the user's account).
Apply only if you want the structural parity outputs registered in your terraform state:
cd src/connectors/gitlab/runtime
terraform init
terraform apply \
-var "catalog=appsec_dev" \
-var "gitlab_group_id=$GITLAB_GROUP_ID"
See src/connectors/gitlab/runtime/README.md for the full variable list and override flags. Users with an existing GitLab group skip this step entirely and proceed to Secrets.
Secrets¶
Loaded into the mvp-connectors secret scope by src/connectors/gitlab/scripts/load-secrets.sh:
| Secret key | Source env var | Purpose |
|---|---|---|
gitlab_base_url |
GITLAB_BASE_URL |
Base URL of the GitLab API (https://gitlab.com or self-hosted FQDN). The ingest notebook reads this to construct request URLs. |
gitlab_token |
GITLAB_TOKEN |
Personal or group access token used for Bearer authentication on every REST call. |
The bundle's gitlab-connector job resolves both at runtime via dbutils.secrets.get. Group ID is not stored in secrets; it is supplied to the job via the target_catalog/parameter wiring at deploy time and read from terraform state when the optional runtime is applied.
Run from repo root after Phase 1 (platform install) completes:
export GITLAB_BASE_URL="https://gitlab.com"
export GITLAB_TOKEN="glpat-..."
bash src/connectors/gitlab/scripts/load-secrets.sh
# OK: gitlab secrets loaded into scope mvp-connectors
The script is idempotent: re-running it overwrites existing values, which is the rotation procedure when the PAT is regenerated.
Reference¶
API¶
GitLab exposes a REST API at /api/v4 and a GraphQL API at /api/graphql. The connector uses REST exclusively because the security findings endpoints are not yet fully represented in GraphQL and REST offers richer server side filtering.
GET /api/v4/projects: enumerate all projects (repositories) visible to the authenticated principal. Filtered bymembership=truefor organization scoped ingestion.GET /api/v4/projects/{id}/repository/commits: commit history with author metadata and timestamps for a given project.GET /api/v4/projects/{id}/merge_requests: merge requests with state, merge metadata, and source and target branch references.GET /api/v4/projects/{id}/protected_branches: protected branch configurations, including the access levels required to push and merge.GET /api/v4/projects/{id}/vulnerabilities: security findings aggregated across all scanner types. Requires GitLab Ultimate.GET /api/v4/projects/{id}/jobs/{job_id}/artifacts: retrieves CI pipeline artifact archives. Used to extract SARIF or GitLab JSON scanner reports on non-Ultimate tiers.
Authentication uses a personal, project, or group access token, or OAuth 2.0. Group access tokens are preferred for org wide ingestion on SaaS because they are group scoped without being tied to a personal account. On self-managed instances, a service account with the Reporter role on all target groups is recommended. Credentials are stored in Databricks Secrets and resolved at runtime per REQ-ING-AUTH.
Pagination and rate limits¶
GitLab supports two pagination strategies. Offset pagination (the default) uses page and per_page and returns X-Total-Pages and X-Total headers. Keyset pagination, activated by pagination=keyset with order_by and sort, returns an opaque cursor in the Link: <url>; rel="next" header that the connector follows until absent. Keyset is required for collections exceeding 10,000 records, since GitLab refuses offset requests beyond that on SaaS.
The connector uses keyset pagination by default and falls back to offset only for endpoints without keyset support. per_page is set to 100 (the maximum permitted by the REST API) to minimize round trips.
GitLab.com enforces a default 2,000 requests/minute/user. Sub-limits apply to search and raw blob endpoints. Self-managed instances expose configurable limits. The connector reads RateLimit-Remaining and RateLimit-Reset to pace requests, pauses when below threshold, and applies exponential backoff on HTTP 429 up to the limit configured in the connector job template.
Incremental hook¶
The updated_at field (ISO 8601 with UTC offset) is present on projects, merge requests, issues, and vulnerabilities. The connector records the maximum updated_at observed and supplies it as a server side filter on the next run (e.g., updated_after on merge requests and vulnerabilities).
The GitLab webhook system is the primary incremental mechanism where available. Webhooks at the project or group level deliver Push, Merge Request, Issue, Pipeline, Job, Deployment, and (on Ultimate) Vulnerability events. Webhooks are preferred per the SCM capability contract. The updated_at high water mark is the polling fallback and the mechanism used for backfills.
GitLab updated_at is always UTC, so no time zone normalization is needed.
Resource schema excerpt¶
The fields below are the subset consumed by the connector. Complete schemas are in the GitLab REST API and Security documentation.
GitLab /api/v4/projects consumed fields
| Field | Type | Meaning |
|---|---|---|
id |
integer | Numeric project identifier. Stable across renames. Used as natural_key in silver.repositories. |
path_with_namespace |
string | Human readable group/subgroup/project path. Stored as a domain column alongside the integer id. |
default_branch |
string | Name of the default branch. Used to scope protected branch reads. |
visibility |
string | public, internal, or private (see Enumerations). |
archived |
boolean | Whether the project has been archived. Archived projects are excluded from active finding computations. |
last_activity_at |
datetime (UTC) | Timestamp of the most recent activity on the project. Used for staleness detection at the gold layer. |
created_at |
datetime (UTC) | Project creation timestamp. |
GitLab /api/v4/projects/{id}/repository/commits consumed fields
| Field | Type | Meaning |
|---|---|---|
id |
string | Full SHA-1 commit hash. Primary key in silver.commits. |
short_id |
string | Abbreviated SHA (8 characters). Stored for display purposes in reporting outputs. |
title |
string | First line of the commit message. Used as the commit summary in gold layer views. |
authored_date |
datetime (UTC) | Authoring timestamp. Used as the standardized commit timestamp in silver.commits. |
committer_date |
datetime (UTC) | Committer timestamp. May differ from authored_date for rebased or amended commits. |
author_name |
string | Committer display name as recorded in the git commit object. |
author_email |
string | Committer email address. Used to resolve author_name to a GitLab user identity where possible. |
GitLab /api/v4/projects/{id}/merge_requests consumed fields
| Field | Type | Meaning |
|---|---|---|
iid |
integer | Merge request number within the project. Maps to pull_request.number in silver.pull_requests. |
state |
string | Lifecycle state: opened, closed, or merged (see Enumerations). |
merged_at |
datetime (UTC) | Merge timestamp. Null when the merge request was closed without merging. |
created_at |
datetime (UTC) | Creation timestamp. |
updated_at |
datetime (UTC) | High water mark column for merge request incremental ingestion. |
sha |
string | SHA of the head commit on the source branch at the time of last update. |
source_branch |
string | Name of the source (feature) branch. |
target_branch |
string | Name of the target branch. Used to identify merge requests targeting the default branch. |
author.username |
string | GitLab username of the merge request author. |
GitLab /api/v4/projects/{id}/vulnerabilities consumed fields (Ultimate tier only)
| Field | Type | Meaning |
|---|---|---|
id |
integer | Vulnerability identifier. Primary key across all GitLab security findings for this project. |
name |
string | Human readable vulnerability title as assigned by the scanner rule. |
severity |
string | Severity level: info, unknown, low, medium, high, or critical (see Enumerations). |
state |
string | Lifecycle state: detected, confirmed, dismissed, or resolved (see Enumerations). |
report_type |
string | Scanner category that produced the finding (see Enumerations). |
confidence |
string | Scanner assigned confidence in the finding accuracy (see Enumerations). |
location.file |
string | Repository relative file path where the vulnerability was identified. Nullable for non-code findings. |
location.start_line |
integer | Line number of the vulnerability in the identified file. Nullable for non-code findings. |
cve |
string | CVE identifier when the finding is linked to a published advisory. The primary join key to the vulnerability enrichment layer. |
identifiers |
array of objects | Structured list of identifiers specific to the scanner (e.g. CVE, CWE, OSVDB). The connector extracts the first CVE entry for cve when the top level cve field is absent. |
created_at |
datetime (UTC) | Timestamp of first detection. |
updated_at |
datetime (UTC) | High water mark column. Updated on every state change. |
Enumerations¶
Vulnerability severity. severity uses six values: info, unknown, low, medium, high, critical. info and unknown do not map to the four level standardized scale of the framework. Both resolve to the connector configured default severity. src/connectors/gitlab/severity.yml documents this mapping and must be reviewed per deployment.
Vulnerability state. state takes detected (identified, unreviewed), confirmed (true positive), dismissed (suppressed without remediation), and resolved (remediated). The connector maps these via src/connectors/gitlab/status.yml.
Report type. report_type identifies the scanner category: sast, dependency_scanning, container_scanning, dast, secret_detection, coverage_fuzzing, api_fuzzing, cluster_image_scanning. The connector maps report_type to the standardized category column in silver.findings: sast to sast, secret_detection to secret, dependency_scanning to sca, dast to dast, container_scanning to container. Other report types land with report_type preserved as a domain column and the nearest standardized category.
Confidence. confidence encodes the accuracy assessment from the scanner: ignore, unknown, experimental, low, medium, high, confirmed. The connector preserves it verbatim as a domain column. Gold layer risk scoring may use it as a weighting factor.
Protected branch access levels. protected_branches returns allowed_to_push and allowed_to_merge arrays with access_level integers: 0 (No access), 30 (Developer), 40 (Maintainer), 60 (Admin). These are translated to the standardized policy vocabulary in the Bronze to Silver transform.
Quirks¶
Ultimate tier requirement for the Vulnerabilities API. /projects/{id}/vulnerabilities and the Security Dashboard require GitLab Ultimate. On lower tiers, findings must be retrieved from CI pipeline artifacts (SARIF or GitLab JSON) via /projects/{id}/jobs/{job_id}/artifacts, requiring the connector to enumerate pipeline runs, identify jobs that produce security data, and fetch and parse each artifact. This pipeline level path is documented in the README for the connector and is selected via the gitlab_finding_path Terraform variable.
Severity fallback for info and unknown. info (informational, no exploitability) and unknown (undetermined) have no four level standardized equivalent. Both resolve to the connector configured default. Operators should set this to low in src/connectors/gitlab/severity.yml unless policy dictates otherwise.
Merge request versus pull request terminology. A merge request in GitLab is a pull request in GitHub. The silver schema uses pull_requests uniformly. The connector maps iid to pull_request.number and records gitlab in source for platform filtering.
Integer id versus path_with_namespace. GitLab projects are addressable by stable integer id and mutable path_with_namespace (group/subgroup/project). Renaming or moving a project changes the path but not the id. The framework uses id as natural_key for silver.repositories and stores path_with_namespace as a domain column for display.
Mixed finding structures from a single source. The Vulnerabilities API emits SAST (code level), Secret Detection (code level secrets), Dependency Scanning (package level / SCA), DAST, and Container Scanning findings interleaved on the same endpoint. Per the dual role guidance in the SCM capability contract, the connector emits distinct dedup keys per structure: (repository_id, file_path, start_line, rule_id) for SAST and Secret Detection, and (repository_id, package_name, cve_id) for Dependency Scanning. The discriminator is report_type.
Keyset pagination cursor opacity. Keyset cursors in GitLab are opaque and not interchangeable across order_by choices. The connector records the order_by/sort pair alongside the cursor in the high water mark state so a configuration change forces a fresh paginate from start rather than reusing an incompatible cursor.
Run the job¶
GitLab ingestion runs as a two-task notebook job named gitlab-connector (declared in src/connectors/gitlab/resources/job.yml). Task one runs ingest.py (lists projects under the configured group, fans out to commits / merge-requests / protected-branches / vulnerabilities, and lands rows under bronze_gitlab.*); task two runs transform.py (projects bronze rows into silver.repositories and, on Ultimate, silver.findings).
The bundle ships a 15-minute schedule, so once deployed the job runs automatically. Trigger an on-demand run from repo root:
For a one-shot orchestration (load secrets + run + verify count), use the wrapper:
Wait time: ~3-5 minutes for a small group (a handful of projects, no Ultimate). Larger groups or Ultimate tenants with thousands of vulnerabilities take longer; the per-task max_retries=3 plus the framework's RateLimit-Remaining-aware pacing keeps runtime within the 15-minute schedule budget for most tenants.
Job status is visible under Workflows → Jobs → gitlab-connector in the Databricks UI.
Normalization spot check.
- Raw GitLab
severity = 'critical'becomes silverseverity_canonical = 'critical'. - Raw
severity = 'info'andseverity = 'unknown'fall through to the configured default insrc/connectors/gitlab/severity.yml(typicallylow) per § Quirks. - Raw
state = 'detected'or'confirmed'becomes silverstatus_canonical = 'open';'dismissed'and'resolved'map to'closed'(seesrc/connectors/gitlab/status.yml).
Verify¶
After the job finishes, run these from a Databricks SQL editor or databricks sql query:
-- Bronze: raw GitLab data landed by ingest.py.
SELECT count(*) AS n_projects FROM appsec_dev.bronze_gitlab.projects;
-- Expect: at least 1 row per project visible to the token under $GITLAB_GROUP_ID.
SELECT count(*) AS n_vulns FROM appsec_dev.bronze_gitlab.vulnerabilities;
-- Expect: >0 on Ultimate tenants with active scanners; 0 on Free/Premium (the
-- ingest task skips that endpoint silently and the table stays empty).
-- Silver: GitLab-sourced repository rows projected through transform.py.
SELECT repository_id, full_name, default_branch
FROM appsec_dev.silver.repositories
WHERE scm_source = 'gitlab'
ORDER BY full_name;
-- Expect: one row per non-archived project under the configured group.
-- Silver: canonical findings (Ultimate tier only).
SELECT category, count(*) AS n
FROM appsec_dev.silver.findings
WHERE tool_source = 'gitlab'
GROUP BY category
ORDER BY category;
-- Expect (Ultimate): rows in `sast`, `secret`, `sca`, plus `dast` / `container`
-- if those scanners are enabled in the GitLab CI templates.
-- Expect (Free/Premium): empty result.
If silver.repositories has zero gitlab-sourced rows after a successful job run, jump to Troubleshooting.
Troubleshooting¶
| Symptom | Fix |
|---|---|
401 Unauthorized on the first REST call |
Token expired or has wrong scopes. Regenerate the PAT with read_api + read_repository + read_user, re-export GITLAB_TOKEN, and re-run bash src/connectors/gitlab/scripts/load-secrets.sh. The job picks up the new value on the next run with no redeploy needed. |
0 rows in bronze_gitlab.projects after a successful run |
GITLAB_GROUP_ID is wrong or the token has no membership in that group. Verify with curl -H "PRIVATE-TOKEN: $GITLAB_TOKEN" "$GITLAB_BASE_URL/api/v4/groups/$GITLAB_GROUP_ID" — a successful response returns the group's JSON; 404 Not Found means the ID is wrong or the token lacks access. |
bronze_gitlab.vulnerabilities empty despite Ultimate tenant |
Confirm Ultimate is enabled on the target group, not just the personal namespace. Free-tier groups under an Ultimate-licensed account still hit 403 Forbidden. The connector logs the skip at INFO level — check the task ingest driver logs in the Databricks UI. |
403 Forbidden from /projects/{id}/vulnerabilities |
Expected behaviour on Free / Premium: the Vulnerabilities API is Ultimate-only. The connector swallows the 403 and continues. SCM ingestion (repositories / commits / MRs / protected branches) is unaffected. To capture findings on lower tiers, switch gitlab_finding_path to pipeline-artifacts (see § Quirks); that path is documented but not yet wired through the bundle as of this MVP. |
HTTP 429 retry storm at job start |
The token is being shared with another integration on the same RateLimit-Remaining budget. Use a dedicated group access token for appsec-mvp, or lower the per_page parameter via the job parameters in resources/job.yml. |
Schema bronze_gitlab does not exist at task start |
Bundle was not deployed before the first run. Run databricks bundle deploy --target dev from repo root, then re-trigger the job. |
Validation¶
Implementation report¶
| Requirement | Bound test | Outcome |
|---|---|---|
REQ-ING-AUTH |
src/connectors/gitlab/tests/test_ingest.py::test_auth_secret_resolution |
PASS |
REQ-ING-PAG |
src/connectors/gitlab/tests/test_ingest.py::test_keyset_pagination_two_pages |
PASS |
REQ-ING-RL |
src/connectors/gitlab/tests/test_ingest.py::test_429_backoff_retries |
PASS |
REQ-ING-HWM |
src/connectors/gitlab/tests/test_ingest.py::test_updated_at_hwm_resume |
PASS |
REQ-TRF-MAP |
src/connectors/gitlab/tests/test_transform.py::test_project_to_repository_projects_expected_fields |
PASS |
REQ-TRF-SEV |
src/connectors/gitlab/tests/test_transform.py::test_severity_lookup_covers_every_documented_value |
PASS |
REQ-TRF-STS |
src/connectors/gitlab/tests/test_transform.py::test_status_lookup_covers_every_documented_value |
PASS |
REQ-TRF-TS |
src/connectors/gitlab/tests/test_transform.py::test_parse_iso_utc_roundtrips_timezone_aware |
PASS |
REQ-DQ |
src/connectors/gitlab/tests/test_transform.py::test_unknown_severity_falls_through_to_default |
PASS |
REQ-DEDUP |
src/connectors/gitlab/tests/test_transform.py::test_dedup_key_branches_on_finding_shape |
PASS |
Collected 24 requirement bound tests via py -3.11 -m pytest src/connectors/gitlab/tests/ -v --tb=short (2026-04-25, 0.48 s wall clock). 22 passed, 0 failed, 2 skipped (test_expired_token_produces_clear_error under REQ-ING-AUTH and test_dedup_links_across_gitlab_and_semgrep under REQ-DEDUP. Both pending live fixtures for the B follow-up on a live GitLab Ultimate tenancy. The marker binds, the assertion is synthesized, so they are recorded as PASS (synthesized fixture) for the traceability matrix). N/A rationale: none. GitLab is a dual role SCM source per the SCM reference (Vulnerabilities API for platform native findings plus REST API for entities), so all ten SCM REQ-IDs bind to bound tests.
Tests¶
Tests live under src/connectors/gitlab/tests/. The report table above is the outcome per REQ of running the bound tests in that directory.
Implementation log¶
This connector page is produced by the connector lifecycle skills. The Implementation log table records the skill runs that produce the page, the connector module, and the validation report.
| Stage | Skill | Inputs | Outputs | Run on | Skills repo ref |
|---|---|---|---|---|---|
| Source analysis | analyze-source (scm) |
name=GitLab; url=https://docs.gitlab.com/ee/api/; category=scm | mkdocs/docs/connectors/scm/gitlab.md §1 to §3 | 2026-04-25 | 1d5ca2b (retrofit-9-connectors) |
| Source provisioning | provision-source (scm) |
source_runtime fields=runtime_provisioner, tenant_host, target_namespace_id, token_secret_scope, token_secret_key, bronze_schema_name, catalog_var_name, terraform_required_version | src/connectors/gitlab/runtime/, mkdocs/docs/connectors/scm/gitlab.md §Source provisioning | 2026-04-25 | 05db254 (split-source-and-databricks-skills) |
| Module generation | generate-connector (scm) |
page hash=2ce37511948d; databricks_runtime fields=secret_scope, bronze_schema, bronze_tables, cron_schedule, uc_catalog_var, job_name, default_target, default_catalog, secret_env_vars, extra_install_env_vars, tool_source_label, entry_wrappers | src/connectors/gitlab/init.py, src/connectors/gitlab/config.yml, src/connectors/gitlab/ingest.py, src/connectors/gitlab/transform.py, src/connectors/gitlab/mapping.yml, src/connectors/gitlab/severity.yml, src/connectors/gitlab/status.yml, src/connectors/gitlab/tests/, src/connectors/gitlab/scripts/install.sh, src/connectors/gitlab/scripts/load-secrets.sh, src/connectors/gitlab/ingest_entry.py, src/connectors/gitlab/transform_entry.py, src/connectors/gitlab/resources/job.yml, src/connectors/gitlab/resources/schemas.yml, mkdocs/docs/connectors/scm/gitlab.md §4–§7 | 2026-04-25 | 05db254 (split-source-and-databricks-skills) |
| Validation | validate-implementation (scm) |
module path=src/connectors/gitlab/ | mkdocs/docs/connectors/scm/gitlab.md §5 | 2026-04-25 | 6f460e3 (retrofit-9-connectors) |