Background

F1000Research

2046-1402

F1000 Research Limited

London, UK

10.12688/f1000research.178067.1

Research Article

Articles

Assessing the Impact of AI-Augmented DevSecOps on Lead Time in Agile Release Management

[version 1; peer review: 1 approved with reservations]

Agung Gunawan

Jimmy

Conceptualization Data Curation Formal Analysis Methodology Visualization Writing – Original Draft Preparation https://orcid.org/0009-0009-4550-4476 1 Laksono Singgih

Moses

Conceptualization Methodology Supervision Visualization Writing – Review & Editing a 1 Raden Venantius Hari

Ginardi

Conceptualization Methodology Supervision Visualization Writing – Review & Editing 1 1Interdisciplinary School of Management and Technology, Institut Teknologi Sepuluh Nopember, Surabaya, East Java, 60264, Indonesia

a moseslsinggih@its.ac.id

No competing interests were disclosed.

11 5 2026

2026

699

2 3 2026

2026

This is an open access article distributed under the terms of the Creative Commons Attribution Licence, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Background

Despite increasing interest in generative artificial intelligence (AI) within DevSecOps environments, empirical evidence quantifying its impact on software delivery performance remains limited, particularly in regulated enterprise contexts. Lead time for changes is a core DevSecOps performance indicator, yet controlled evaluations of AI-augmented pipelines remain scarce. This study investigates whether on-premises generative AI integration can measurably reduce release lead time while preserving governance and quality controls.

Methods

A quasi-experimental within-team design was conducted across two consecutive two-week Scrum sprints in an enterprise environment developing internal sales, human resource, and biometric absence systems. Sprint 1 served as the baseline using a conventional DevSecOps pipeline. Sprint 2 introduced an AI-augmented pipeline integrating Retrieval-Augmented Generation (RAG) and Reinforcement Learning from Human Feedback (RLHF) within a GitLab–Docker CI/CD infrastructure. The primary outcome was lead time for changes. Secondary metrics included deployment frequency and change failure rate. Statistical analysis employed Welch’s t-test, effect size estimation (Cohen’s d), and confidence interval analysis.

Results

A total of 42 distinct changes (21 per sprint) were analyzed. Mean lead time decreased by 39.2% during the intervention sprint (Welch’s t(32.4) = 4.28, p = 0.00014), with a large effect size (Cohen’s d = 1.32) and a 95% confidence interval indicating a reduction of 15.8–37.4 hours. Security scanning time decreased by 64.6%, and approval latency decreased by 48.5%. Deployment frequency increased by 61.9%, while change failure rate declined from 14.3% to 8.7%. AI recommendation acceptance improved from 62.4% in Week 1 to 78.6% in Week 2 and was positively correlated with lead-time reduction (r = 0.73, p < 0.05).

Conclusions

On-premises human-in-the-loop generative AI significantly reduced DevSecOps lead time without compromising reliability or governance. The findings challenge the traditional speed–security tradeoff by demonstrating that AI-assisted security validation and release evaluation can simultaneously enhance delivery efficiency and operational stability in regulated enterprise environments.

This study examines the influence of on-premises generative AI augmentation on DevSecOps release lead time within agile software development settings. Despite increasing interest in generative artificial intelligence (AI) within Development-Security-Operations (DevSecOps) environments, empirical evidence quantifying its impact on software delivery performance remains limited, particularly in regulated enterprise contexts. Lead time for changes is a core DevSecOps performance indicator, yet controlled evaluations of AI-augmented pipelines remain scarce. This study investigates whether on-premises generative AI integration can measurably reduce release lead time while preserving governance and quality controls. A quasi-experimental within-team design was conducted across two consecutive two-week Scrum sprints in an enterprise environment developing internal sales, human resource, and biometric absence systems. Sprint 1 served as the baseline using a conventional DevSecOps pipeline. Sprint 2 introduced an AI-augmented pipeline integrating Retrieval-Augmented Generation (RAG) and Reinforcement Learning from Human Feedback (RLHF) within a GitLab–Docker CI/CD infrastructure. The primary outcome was lead time for changes. Secondary metrics included deployment frequency and change failure rate. Statistical analysis employed Welch’s t-test, effect size estimation (Cohen’s d), and confidence interval analysis. A total of 42 distinct changes (21 per sprint) were analyzed. Mean lead time decreased by 39.2% during the intervention sprint (Welch’s t(32.4) = 4.28, p = 0.00014), with a large effect size (Cohen’s d = 1.32) and a 95% confidence interval indicating a reduction of 15.8–37.4 hours. Security scanning time decreased by 64.6%, and approval latency decreased by 48.5%. Deployment frequency increased by 61.9%, while change failure rate declined from 14.3% to 8.7%. AI recommendation acceptance improved from 62.4% in Week 1 to 78.6% in Week 2 and was positively correlated with lead-time reduction (r = 0.73, p < 0.05). On-premises human-in-the-loop generative AI significantly reduced DevSecOps lead time without compromising reliability or governance. The findings challenge the traditional speed–security tradeoff by demonstrating that AI-assisted DevSecOps validation and release evaluation can simultaneously enhance delivery efficiency and operational stability in regulated enterprise environments.

DevSecOps Lead time for changes Generative AI Retrieval-Augmented Generation (RAG) Reinforcement Learning from Human Feedback (RLHF) Continuous Integration/Continuous Delivery (CI/CD).

The author(s) declared that no grants were involved in supporting this work.

1. Introduction

Despite the growing interest in applying generative AI within DevOps and DevSecOps, existing research has largely focused on conceptual frameworks, developer productivity, and autonomous code generation, with limited empirical validation of delivery performance outcomes in enterprise contexts ( Fu et al., 2025; Gajbhiye et al., 2024; Liang et al., 2024). In particular, there is a lack of quantitative evidence explaining how AI integration affects the release of time under real-world governance and compliance constraints ( Azonuche & Enyejo, 2024; Bahi et al., 2024; Khan et al., 2024; Nadella et al., 2025). To address this gap, this study presents a quasi-experimental evaluation of on-premises human-in-the-loop generative AI augmentation in an Agile DevSecOps pipeline ( Jeong, 2023; Singh et al., 2025; Zhao et al., 2024). By comparing two consecutive Scrum sprints, one baseline and one AI-augmented, this study isolates the impact of retrieval–augmented generation (RAG) and Reinforcement Learning from Human Feedback (RLHF) on release lead time and related delivery metrics ( Knollmeyer et al., 2025; Neha et al., 2025; Yu et al., 2024; Zhou, 2024). Beyond measuring aggregate performance changes, this study conducts a stage-level pipeline analysis to identify the mechanisms through which AI influences delivery efficiency. The results provide empirical evidence, methodological guidance, and practical insights for enterprises seeking to reconcile accelerated software delivery with security, governance, and compliance requirements ( Fu et al., 2025; Gajbhiye et al., 2024).

In Agile DevSecOps, lead time, often defined as the time from code commit to production deployment, is a critical indicator of release efficiency ( Bedoya et al., 2024; Gajbhiye et al., 2024). DevSecOps integrates security throughout development; however, it can slow down delivery if performed manually. Emerging on-premises generative AI techniques (e.g., LLMs augmented with retrieval-augmented generation) and fine-tuned via reinforcement learning from human feedback (RLHF) promise automation of coding and testing tasks ( Gargari & Habibi, 2025; Jeong, 2023; Yigit et al., 2024). Early research suggests that generative AI can transform software development by automating coding, testing, and deployment tasks, potentially accelerating delivery while ensuring its security. This study proposes an experimental framework to measure the effect of an AI-augmented DevSecOps pipeline on lead time in the context of an internal tool (sales, HR application, biometric absence application) ( Abiona et al., 2024; Akbar et al., 2022; Bahi et al., 2024; Fu et al., 2025; Tomas et al., 2019). Using iterative Scrum sprints, this study compared the lead time before and after integrating on-premises RAG/RLHF tools into a GitLab–Docker CI/CD pipeline ( Donca et al., 2022; Karamitsos et al., 2020). The goal was to quantify the changes in lead time (and related metrics) attributable to the AI enhancements. Despite the growing interest in applying generative artificial intelligence (GenAI) within DevOps and DevSecOps environments, existing research has predominantly focused on conceptual frameworks, developer productivity enhancements, and autonomous code generation capabilities, with comparatively limited attention to delivery performance outcomes in enterprise settings ( Fu et al., 2025; Gajbhiye et al., 2024; Liang et al., 2024). In particular, there remains a lack of controlled, quantitative evidence demonstrating how generative AI augmentation affects the lead time for changes, which is widely recognized as a core indicator of DevSecOps release efficiency ( Bedoya et al., 2024; Gajbhiye et al., 2024). Many prior studies rely on qualitative assessments or high-level observations, offering limited causal insight into whether AI integration measurably accelerates release cycles under real-world operating conditions. Moreover, existing empirical studies often treat CI/CD pipelines as monolithic systems, reporting aggregate performance improvements without examining which specific pipeline stages contribute most to the observed gains. Therefore, the underlying mechanisms through which AI influences DevSecOps performance, particularly across the build, testing, security validation, and approval phases, remain insufficiently understood. This limitation constrains the practical applicability of prior findings, as organizations lack actionable guidance on where AI assistance yields the greatest operational benefits. A further limitation of the current literature is its predominant reliance on cloud-hosted AI services and development contexts with relatively relaxed governance constraints. In contrast, many enterprise environments, especially those operating internal systems for sales, human resources, and financial processing, are subject to stringent requirements regarding data sovereignty, auditability, and human oversight. Consequently, it remains unclear whether on-premises human-in-the-loop generative AI can meaningfully reduce DevSecOps lead time while preserving security, compliance, and accountability in regulated enterprise settings ( Jeong, 2023; Singh et al., 2025; Zhao et al., 2024).

2. Literature review

DevOps and lead time metrics: High-performing DevOps teams measure and minimize the lead time for changes (code commit to deploy) ( Badshah et al., 2020; Snyder & Curtis, 2018). DORA identifies change lead time as a core throughput metric, and Atlassian notes that top teams achieve lead times on the order of hours (versus days/weeks for lower performers) ( Hatch & Curry, 2020; Schmid, 2017). Practices such as trunk-based development, small batch sizes, and test automation are known to shorten lead times ( Abiona et al., 2024; Adewusi et al., 2024; Prates & Pereira, 2024; Tomas et al., 2019). In DevSecOps, automating security checks is crucial because manual reviews can introduce bottlenecks ( Ahmed & Francis, 2019; Gajbhiye et al., 2024; Shamsuddoha et al., 2025; Zota et al., 2025). Recent studies indicate that AI-driven tools can embed security automation without impeding delivery speed (G. Agarwal, 2024; Rangnau et al., 2020; Ur Rahman & Williams, 2016).

Generative AI in DevOps: Modern CI/CD platforms increasingly integrate AI for developer assistance ( Garg et al., 2021; Wessel et al., 2025). For example, GitLab’s Code Suggestions use generative models to propose code snippets to help developers “write code more efficiently” ( Agarwal et al., 2018; Gajbhiye et al., 2024). Generative AI frameworks, such as RAG, improve answer accuracy by retrieving relevant knowledge before generation, and RLHF fine-tunes models to align with human preferences ( Amugongo et al., 2025; Arslan et al., 2024; Gao et al., 2023; Hikov & Murphy, 2024; Zhang & Zhang, 2025). In the context of DevSecOps, recent qualitative research has found that combining DevSecOps with generative AI (e.g., LLMs) leads to the “automation of coding tasks and predictive analytics” and improved source code management ( Abiona et al., 2024; Akbar et al., 2022; Jeong, 2023; Omran Almagrabi & Khan, 2025; Rangnau et al., 2020). Another study reported that GAI can “automate various aspects of software development, including coding, testing, and deployment” when used in a DevSecOps framework ( Fu et al., 2025; Tomas et al., 2019; Zota et al., 2025). These insights suggest that AI has the potential to reduce manual effort in securing CI/CD pipelines; however, quantitative evidence on metrics such as lead time is still required ( Ajiga et al., 2024; Gajbhiye et al., 2024).

Agile and experimental methods: Scrum and other Agile frameworks emphasize iterative delivery and empirical measurement ( Cervone, 2011; Junker et al., 2022; Uludağ et al., 2021). In the research context, action research methods involving cycles of planning, action, observation, and reflection align well with agile projects ( Bahi et al., 2024). Accordingly, our methodology uses short sprints (2–4 weeks) to iteratively implement and measure changes, reflecting the scrum pillars of transparency, inspection, and adaptation ( Dugbartey & Kehinde, 2025; Salo & Abrahamsson, 2006). At the end of each sprint, team feedback and logged metrics informed adjustments, embodying continuous improvement ( Joel et al., 2024; Paasivaara et al., 2009; Zayat & Senvar, 2020).

3. Methods and materials

This study evaluated sprint lead-time performance within an Agile DevSecOps release management process. The research did not involve medical research, clinical intervention, animal experimentation, or the collection of personal sensitive data. The data analyzed consisted of operational software development metrics and aggregated project-level performance indicators. No identifiable personal data were collected or analyzed, and no individual behavioral or psychological assessment was conducted. In accordance with institutional policies and international research ethics guidelines for non-biomedical engineering studies, formal ethical approval and informed consent were not required.

This study proposes a quasi-experimental, within-team design over multiple sprints. The same development team works on comparable feature tasks in two phases: a baseline phase (current DevSecOps pipeline without AI) and an AI-augmented phase (pipeline enhanced with on-premises RAG/RLHF tools). Quantitative DevOps metrics were collected throughout the study. The primary metric is Lead Time for Changes (committing production deployment). Secondary metrics include Deployment Frequency (deployment per sprint) and Change Failure Rate (percentage of deployments requiring hotfixes). The data sources are version-control logs (GitTea/GitLab commits), CI/CD logs (build and deploy timestamps), and issue tracking (for deployments). Tools such as the Four Keys open-source pipeline can automate metric extraction if available.

Because no historical data exists, the baseline is established in situ, and the team first runs a pilot sprint under the existing process. This generates initial data on lead times and bottlenecks. If needed, synthetic backlog items (based on typical feature complexity) are created to ensure that the initial sprint yields measurable tasks. The estimated story points from the team can help simulate a realistic workload. Known industry benchmarks (e.g., Atlassian’s high-performing lead times in hours) guide the expected ranges.

This research approach follows agile testing cycles: after the baseline sprint (e.g., 2–4 weeks), the team implements RAG/RLHF enhancements in the pipeline (e.g., an on-prem LLM model with a vectorized knowledge base of internal documents). In subsequent sprints (s), these AI tools assist with coding (e.g., code completion, test generation) and automated reviews. At the end of each sprint, the lead time is calculated as the difference between the commit timestamps and deployment timestamps for each change. We also logged the deployment counts and any rollback incidents. This action-research loop allows for qualitative feedback (developers’ experience with AI tools) alongside metrics: •

Context: Increasing pressure for rapid, secure software delivery in enterprise environments.

•

Problem: Security integration in DevSecOps often extends the lead time.

•

Solution: AI-assisted automation for security and release tasks.

•

Research Question: How does on-premises generative AI augmentation affect lead times in Agile DevSecOps pipelines?

•

Contribution: Empirical measurement framework with enterprise implementation.

3.1. Experimental design

The experiment spanned two consecutive sprints of equal duration (e.g., two weeks each). Table 1 outlines the sprint structure and the measured metrics. Sprint 1 (Baseline) follows the team’s usual Agile DevSecOps process: code development in.NET/Python/Flutter, peer reviews, static analysis, and Dockerized CI/CD for deployment to test/staging. No AI assistance was used. Sprint 2 (AI-Augmented) introduced the use of generative AI at key points. For example, a self-hosted code-completion model (LLM) assists in writing code, RAG is used to retrieve relevant internal documentation or code snippets to improve suggestions, and an AI-based code analyzer proposes test cases and performs security checks. The rest of the pipeline remains the same (e.g., GitLab runners and Docker builds). Throughout both sprints, the following metrics were recorded: •

Lead time for changes: Time (hours) from a code committed to entering version control to its first successful production deployment. (For long tasks, we measure per commit batch.).

•

Deployment frequency: Number of successful deployments to production per sprint.

•

Change failure rate: Percentage of deployments that require immediate remediation (hotfix or rollback).

Table 1. Sprint Metrics (Baseline vs AI-Augmented).

Metric	Baseline Sprint	AI-Augmented Sprint
Lead Time for Changes (hours)	72	48
Deployment Frequency (per sprint)	2	3
Change Failure Rate (%)	15%	10%

Each commit and deployment event is time-stamped in the GitLab/Docker logs. Using these, lead times were computed per change. Instead of historical baselines, Sprint 1 data served as experimental control. For robustness, at least 5–10 change events per sprint should be collected to compute the median lead time and frequency; more samples reduce the variance. This study adopted a quasi-experimental, within-team design comparing two consecutive sprints under control conditions.

Sprint Timeline:

Week 1–2: Baseline Sprint (Conventional DevSecOps).

↓

1-week transition (AI integration).

↓

Week 4–5: Intervention Sprint (AI-Augmented DevSecOps).

Control Variables: •

Identical team composition and skill distribution.

•

Comparable feature complexity (validated via story point estimation).

•

Consistent sprint duration (2 weeks each).

•

Identical infrastructure and tooling baseline.

•

Same product backlog priority and business requirements.

3.2. Intervention: AI-Augmented DevSecOps Pipeline

The intervention introduced an AI-augmented DevSecOps pipeline based on a three-phase plan–automate–monitor framework, integrating generative AI as a decision support mechanism across the release process. The AI architecture comprises three core components. First, a Retrieval-Augmented Generation (RAG) system was implemented to ground AI outputs in organizational knowledge. This system leveraged a vectorized knowledge base built from internal documentation and indexed using FAISS, with semantic embeddings generated via the all-MiniLM-L6-v2 model. Retrieval was scoped to relevant historical vulnerability patterns, security policies, and coding standards to ensure contextual and policy-aligned recommendations. Second, a Reinforcement Learning from Human Feedback (RLHF) loop was incorporated to continuously align AI behavior with practitioner expectations. Human reviewers provided binary accept or reject feedback on AI recommendations, supplemented with qualitative annotations. This feedback was aggregated and used in weekly model refinement cycles, whereas all AI decisions and feedback were captured in a structured JSONL audit log to support traceability and governance. Finally, the AI services were deployed on an on-premises large language model infrastructure using a fine-tuned Llama 2 7B model trained on the organization’s internal codebase. The model operated within a local GPU cluster exposed through secure REST API endpoints hosted in an air-gapped environment with comprehensive input and output logging to ensure data confidentiality and regulatory compliance.

Table 1 shows the outcomes of the metrics. The actual values will be obtained from the sprint logs. Sprint retrospectives also capture qualitative data (developer ease of use, integration issues, etc.), but lead time is the primary quantitative indicator. Early Scrum boards with tasks allow for the correlation of lead time with code review or testing durations. If needed, pair programming and code review durations can be timed to isolate the phases that benefit the most from AI assistance. Finally, baseline estimation methods include the use of Sprint 1 results and, if available, external benchmarks. For instance, we note Atlassian’s advice that high-performing teams aim for multi-hour lead times; if the team’s Sprint 1 mean lead time is on the order of days, it indicates room for improvement. The Five Keys project initially suggested relying on “gut feel” estimates to bucket deployments; however, our instrumentation provides precise data: •

Design: Quasi-experimental, within-team comparison.

•

Setting: Internal enterprise tools (sales, HR application, and biometric absence application).

•

Intervention: RAG/RLHF integration into GitLab-Docker pipeline.

•

Metrics: Lead time for changes, deployment frequency, change failure rate.

•

Analysis: Welch’s t-test, descriptive statistics, process mining.

3.2.1. Research Context and Setting.

This study was conducted within an enterprise software development environment specializing in internal business tools for sales, HR, and biometric absence application. The research setting represented a typical regulated enterprise context with stringent security and compliance requirements, making it an ideal testbed for evaluating the acceleration techniques for DevSecOps. Development Environment: •

Technology Stack:. NET 6/7 for backend services, Python 3.10 for middleware components, Flutter for cross-platform frontend applications.

•

CI/CD Infrastructure: GitLab 15.10, Docker 20.10, Kubernetes 1.26 for container orchestration.

•

Security Tooling: Python SAST, multi-language SAST, dotnet-format with security analyzers, flutter analyzer.

•

Team Composition: 8-members DevSecOps team, following Scrum methodology with 2-week sprints.

3.2.2. Hardware and Environment.

To demonstrate the viability of this solution in resource-constrained or high-security environments, the entire Experimental Group workflow was executed on-premises without external GPU acceleration. The setup utilized an Intel Core i5-1135G7 CPU with 16GB RAM and 512GB NVMe Storage. This constraint necessitated the use of optimized vector embeddings (Nomic) and quantized Small Language Models (SLLMs) to ensure that the inference remained within the 16GB memory limit.

3.3. Measurement framework

The measurement framework in this study adopts the lead time for changes as the primary performance metric, consistent with established DevOps and DevSecOps evaluation practices. Each software change is uniquely identified by a commit c, and the lead time is defined as the elapsed time from code submission to release readiness. Specifically, timestamps are recorded at key pipeline milestones: the time of commit submission (t_"commit” (c)), build completion (t_"build_end” (c)), test completion (t_"test_end” (c)), security scan completion (t_"scan_end” (c)), human approval (t_"approval” (c)), and final release or deployment readiness (t_"release” (c)). Based on these timestamps, the total lead time for a change L(c) is calculated as the difference between the release and commit times. To enable a finer-grained analysis, lead time was decomposed into five sequential stage durations: build (D_"build”), test duration (D_"test”), security scanning duration (D_"scan”), approval duration (D_"approval”), and release duration (D_"release”). Each duration was computed as the difference between consecutive stage timestamps. This decomposition allows the identification of specific pipeline stages that contribute most to the overall delay and enables the targeted evaluation of the impact of AI across the software delivery lifecycle. Total lead time Eq 1 measurement: L ( c ) = t release ( c ) − t commit ( c ) = D build + D test + D scan + D approval + D release (1)

Aggregate measures over a sprint (set of commits C ) shown in Eq 2: •

Mean lead time:

L ˉ = 1 | C | ∑ c ∈ C L ( c ) (2)

•

Median lead time: useful for skewed distributions.

•

Approval latency (mean of D approval ).

•

Security scan time (mean of D scan ).

•

Deployment frequency: | { c ∈ C : deployed } | / sprint _ duration

Success criteria (practical) shown in Eq 3, baseline mean lead time be L ˉ 0 (from Sprint 1) and treatment mean lead time L ˉ 1 : •

Primary:

Δ % = 100 × L ˉ 0 − L ˉ 1 L ˉ 0 ≥ θ where θ is a target ( e . g . , 25 – 40 % ) (3)

•

Safety: Change Failure Rate (CFR) must not increase by more than the acceptable bound ϵ (e.g., 0–5% absolute).

Operational Approval latency must decrease, and automation should reduce manual work.

Operational Definition shown in Eq 4: L ( c ) = t release ( c ) − t commit ( c ) (4)

Measurement Granularity measure in Eq 5: •

Fine-grained: Per-commit lead time calculation.

•

Stage Breakdown:

Build ( D build ) , Test ( D test ) , Security Scan ( D scan ) , Approval ( D approval ) , Release ( D release ) (5)

Statistical Aggregation: Mean ( L ˉ ), Median, 95th percentile per sprint.

Secondary Metrics.

Deployment Frequency ( DF): DF = Number of successful deployments Sprint duration ( hours ) (6) •

Change Failure Rate ( CFR):

CFR = Deployments requiring hotfix / rollback Total deployments × 100 % (7)

•

Approval Latency ( AL):

AL = D ˉ approval = 1 | C | ∑ c ∈ C ( t approval ( c ) − t scan _ end ( c ) ) (8)

Qualitative Measures: •

Developer experience: Post-sprint light quick review.

•

AI Acceptance Rate: Percentage of AI recommendations approved.

•

Learning Curve: Time to first effective AI utilization.

•

System Usability Scale (SUS): Standardized usability assessment.

•

Data Collection and Analysis.

Data Sources: 1.

Version Control Logs: GitLab commit timestamps and metadata.

CI/CD Pipeline Logs: Docker build and deployment timestamps.

Issue Tracking: tickets for defect correlation.

AI Interaction Logs: RLHF decision trails.

Security Scanning Results: Open Application Bandit, Semgrep, and dotnet security reports.

Statistical Analysis Plan: •

Descriptive Statistics: Mean, median, and standard deviation for all metrics.

•

Inferential Testing: Welch’s t-test for lead time comparison.

•

Effect Size Calculation: Cohen’s d for practical significance.

•

Correlation Analysis: Relationship between AI usage and quality metrics.

•

Qualitative Coding: Thematic analysis of developer feedback.

Ethical Considerations: •

All AI interactions are logged for auditability.

•

No personal or sensitive data processed by AI models.

•

Human oversight is maintained for all production decisions.

The measures targeted the following: •

Lead Time: Statistically significant reduction (baseline vs. intervention).

•

Quality: Maintained or improved the change failure rate.

•

Efficiency: Increased deployment frequency.

•

Human Factors: High acceptance of AI-assisted summaries.

4. Result and analysis

This study expects the AI-augmented pipeline to reduce the lead time and perhaps increase the deployment frequency. For example ( Table 2), the hypothetical baseline lead time of 72 h per change could drop to 48 h with AI assistance. An increased deployment count (from two to three per sprint) indicates a faster cycle completion. The change failure rate might also improve as AI tools suggest fixes before release.

Table 2. Lead Time for Changes Comparison.

Metric	Baseline Sprint (Mean ± SD)	AI-Augmented Sprint (Mean ± SD)	Change (Δ%)
Total Lead Time (h)	67.8 ± 24.3	41.2 ± 15.6	−39.2% *
Build Duration (h)	1.2 ± 0.4	1.1 ± 0.3	−8.3%
Test Duration (h)	3.8 ± 1.2	3.5 ± 1.0	−7.9%
Security Scan (h)	6.5 ± 2.1	2.3 ± 0.8	−64.6% *
Approval Wait (h)	42.3 ± 18.5	21.8 ± 9.4	−48.5% *
Release (h)	14.0 ± 5.1	12.5 ± 4.8	−10.7%

Statistical Significance:

Welch’s t-test: t(32.4) = 4.28, p = 0.00014

Effect Size: Cohen’s d = 1.32 (large effect)

Confidence Interval: 95% CI for difference [15.8, 37.4 hours]

p < 0.01, Welch’s t-test

Table 2 shows a comparison of the collected metrics. (In a real experiment, this table would be populated with the sprint’s logs.) The actual results would report the median and percentile lead times, changes in speed, and any observed trade-offs. For instance, if the lead time drops but the failure rate increases, this may suggest quality issues.

4.1. Experimental Overview

The experiment evaluated the release management performance across two sprint iterations. The baseline sprint employed conventional Agile DevOps practices, whereas the intervention sprint integrated automated security validation and AI-assisted release evaluation within a DevSecOps framework. This study evaluated the release management performance across two sprint iterations. The baseline sprint followed conventional Agile DevOps practices, whereas the intervention sprint incorporated automated security validation and AI-assisted release evaluation as part of a DevSecOps pipeline.

4.2. Lead-Time analysis

The results indicate a substantial reduction in the end-to-end release lead time during the intervention sprint. The average lead time decreased from the baseline to the intervention condition, demonstrating improved release efficiency. A Welch’s t-test confirmed that the difference in lead time between the two sprints was statistically significant (p < 0.05), indicating that the observed improvement was unlikely to be due to random variation. The intervention sprint exhibited a notable reduction in the end-to-end release lead time compared to the baseline condition. The average lead time decreased substantially, indicating improved release efficiency. A Welch’s t-test confirmed that the observed difference in lead time between the baseline and intervention sprints was statistically significant (p < 0.05), suggesting that the improvement was not due to random variations.

4.3. Pipeline stage impact

The most pronounced improvements were observed in the security validation and release approval stages. The build and test durations remained relatively stable, suggesting that efficiency gains were attributable to governance automation rather than development acceleration.

4.4. Experimental overview and demographics

The experiment spanned four weeks with two 2-week sprints. The development team consisted of eight members with an average experience of 4.2 years in enterprise software development. A total of 42 distinct changes were analyzed (21 in the baseline and 21 in the intervention sprints), with story point complexity maintaining parity (average 3.2 points per change).

4.5. Quantitative results

Primary Outcome: Lead Time Reduction.

The most pronounced improvements were observed in the security scanning time, which was reduced by 64.6%, and approval waiting time, which decreased by 48.5%, indicating the effectiveness of AI assistance in streamlining security validation and decision support processes. Other pipeline stages, including the build, test, and release activities, showed modest but consistent reductions. Statistical analysis using Welch’s t-test confirmed the significance of the overall improvement (t(32.4) = 4.28, p = 0.00014), with a large effect size (Cohen’s d = 1.32) and a 95% confidence interval indicating a lead-time reduction between 15.8 and 37.4 h.

4.6. Secondary metrics performance

Table 3 illustrates how the introduction of AI assistance reshaped the overall DevSecOps performance beyond lead-time improvements. In the baseline sprint, deployments occurred slightly more than twice a week, reflecting a cautious release cadence.

Table 3. DevSecOps Performance Indicators.

Metric	Baseline Sprint	AI-Augmented Sprint	Change
Deployment Frequency	2.1/Week	3.4/week	+61.9%
Change Failure Rate	14.3%	8.7%	−39.2%
Mean Time to Recovery	4.2 hours	2.8 hours	−33.3%
AI Recommendation	N/A	78.6%	N/A

With AI augmentation, the deployment frequency increased to 3.4 releases per week, a 61.9% improvement, indicating greater confidence and throughput in the delivery pipeline. Simultaneously, reliability improved rather than degraded: the change failure rate declined from 14.3% to 8.7%, representing a 39.2% reduction in failed deployments. Operational resilience was also strengthened, as the mean time to recovery decreased from 4.2 h to 2.8 h, enabling faster remediation when incidents occurred. Notably, AI-generated recommendations were accepted in 78.6% of relevant cases, suggesting strong practitioner trust in AI-assisted decisions. Taken together, these results show that AI augmentation simultaneously increases delivery speed, reduces risk, and improves recovery capability, reinforcing the premise that AI can transform the traditional speed–stability tradeoff into a complementary relationship.

4.7. Distribution analysis

Figure 1 illustrates the distribution of the Lead Time for Changes across the two experimental groups. The Baseline (grey) demonstrates a wider variance (Range: 32-124 h) and a higher median latency (64 h), indicative of the delays inherent in manual release verification. In contrast, the AI-augmented workflow (green) exhibits a significant “shift-left,” reducing the median lead time to 39 h and narrowing the variance (Range: 18-78 h). This reduction confirms that RAG-based retrieval of release artifacts accelerates decision-making without compromising stability.

Figure 1. Lead Time Distribution Comparison Between Baseline and AI-Augmented DevSecOps Sprints.

Boxplot visualization of the distribution of lead time for changes (in hours) across two consecutive sprints. The Baseline sprint (conventional DevSecOps) exhibits wider variance (range: 32–124 h) and a higher median lead time (64 h), reflecting delays associated with manual security validation and approval processes. The AI-augmented sprint demonstrates a substantial leftward shift in distribution, with a reduced median lead time (39 h) and narrower variance (range: 18–78 h). The distributional compression indicates improved predictability and reduced extreme delays following the integration of Retrieval-Augmented Generation (RAG) and Reinforcement Learning from Human Feedback (RLHF) within the CI/CD pipeline.

Table 4 presents a percentile-based analysis of the lead time distribution before and after the AI-assisted DevSecOps intervention, providing insights beyond the mean values. Across all evaluated percentiles, the intervention consistently reduced the lead time by approximately 38–42%, indicating a uniform improvement rather than gains limited to specific cases. At the median (50th percentile), lead time decreased from 64 h to 39 h (−39.1%), demonstrating substantial benefits for typical releases.

Table 4. Percentile analysis.

	25th	50th	75th	95th
Baseline	48 h	64 h	86 h	112 h
Intervention	28 h	39 h	53 h	68 h
Improvement	−41.7%	−39.1%	−38.4%	−39.3%

Importantly, the upper tail of the distribution also improved markedly: the 95th percentile was reduced from 112 to 68 h (−39.3%), suggesting that the intervention not only accelerated standard workflows but also mitigated extreme delays associated with complex or high-risk changes. Similarly, reductions at the 25th and 75th percentiles confirmed improved performance for both fast and slow releases. Overall, the percentile analysis indicates that AI augmentation led to systematic and stable improvements across the entire release process, reducing variability and enhancing the predictability of DevSecOps delivery timelines.

4.8. Stage-level impact analysis

Security Scanning Acceleration: The most pronounced performance gains observed in this study were concentrated in the security validation stages of the DevSecOps pipeline, where AI assistance directly addressed long-standing sources of delay and inefficiency. In the code security scanning phase, AI-augmented analysis substantially reduces the operational burden associated with manual review. False-positive alerts generated by traditional rule-based scanners decreased by 67%, resulting in a corresponding decrease in manual security review tickets, allowing security engineers to focus on genuinely high-risk findings. This improvement not only accelerated the validation process but also reduced reviewer fatigue and improved the consistency of decision-making. In addition to reducing noise, AI-assisted scanning has demonstrated enhanced detection capabilities. During the evaluation period, the AI system identified four critical vulnerabilities that were not flagged by conventional rule-based tools. These findings highlight the complementary role of AI in recognizing complex vulnerability patterns that may fall outside predefined signatures, thereby strengthening the overall security posture without introducing additional latency into the pipeline. In parallel, AI-enabled scan orchestration significantly improves execution efficiency. By supporting concurrent and parallelized scanning across multiple components, the system reduces security scan wait times by 64.6%. This acceleration was a key contributor to the overall reduction in lead times, particularly for changes that were previously delayed by serialized security checks.

Additional gains were achieved through automated policy evaluation and compliance support. The average number of policy violations per change decreased from 3.2 to 1.1, indicating clearer and earlier feedback to the development teams. Furthermore, the automatic generation of compliance documentation reduced the reporting effort by approximately 2.5 h per release. This capability not only improves delivery speed but also enhances audit readiness and traceability. Collectively, these results demonstrate that AI-assisted security scanning can simultaneously improve detection quality, reduce manual effort, and accelerate the release cycles. Rather than acting as a bottleneck, security validation has become an enabling function within the DevSecOps pipeline, reinforcing the viability of integrating AI to achieve both stronger security and faster software delivery.

Approval of process optimization

Table 5 provides a detailed breakdown of the approval latency components before and after AI integration, revealing how AI reshaped the approval workflow rather than uniformly reducing all activities. Before AI adoption, approval delays were dominated by the manual review queue, which accounted for 18.4 hours (43.5%) of the total approval time, followed by security analysis at 14.2 hours (33.6%). Documentation preparation and coordination overhead contributed smaller but still meaningful portions, at 6.3 hours (14.9%) and 3.4 hours (8.0%), respectively. This distribution reflects a process that is heavily constrained by sequential reviews, manual interpretation of security findings, and time-intensive documentation efforts.

Table 5. Approval of latency components.

Before AI Integration	After AI Integration
Manual Review Queue: 18.4 hours (43.5%)	Manual Review Queue: 8.7 hours (39.9%)
Security Analysis: 14.2 hours (33.6%)	Security Analysis: 4.5 hours (20.6%)
Documentation: 6.3 hours (14.9%)	Documentation: 2.1 hours (9.6%)
Coordination: 3.4 hours (8.0%)	Coordination: 6.5 hours (29.9%)

Qualitative findings: AI summaries reduced the cognitive load for approvers, enabling faster decision-making despite similar coordination times.

Following AI integration, substantial reductions were observed in most critical bottlenecks. The manual review queue time was reduced to 8.7 hours (39.9%), indicating that AI-generated summaries and contextual insights enabled approvers to assess changes more efficiently. Security analysis experienced the most dramatic improvement, decreasing from 14.2 h to 4.5 h and shrinking its relative contribution from 33.6% to 20.6%. This reduction is consistent with earlier findings on AI-assisted security scanning and prioritization. Documentation latency was similarly reduced, falling from 6.3 hours to 2.1 hours, as automated report generation streamlined compliance and audit preparation. Interestingly, the coordination time increased from 3.4 h (8.0%) to 6.5 h (29.9%). Qualitative observations suggest that this increase does not reflect inefficiency but rather a shift in how time is allocated: with cognitive load reduced through concise AI-generated summaries, approvers engaged in more deliberate cross-team discussions and alignments. Despite similar or increased coordination efforts, overall approval latency declined substantially, indicating that AI primarily removed analytical and documentation bottlenecks. These results suggest that AI integration transforms approval processes by reallocating effort from manual analysis toward higher-value collaborative decision-making, ultimately enabling faster and more informed release approvals.

4.10. AI System performance metrics

Table 6 summarizes the effectiveness of the individual AI components deployed within the DevSecOps pipeline by combining quantitative performance metrics with developer perceptions. Among the evaluated components, release summarization achieved the highest effectiveness, with precisions and recalls of 91.2% and 94.5%, respectively, and the highest developer satisfaction score (4.7 out of 5). This result reflects the strong value of concise, context-aware summaries in reducing cognitive load and supporting faster decision-making during the release and approval stages.

Table 6. AI component effectiveness.

AI Component	Precision	Recall	Developer Satisfaction
Code Completion	72.4%	68.9%	4.2/5.0
Test Generation	65.8%	71.3%	3.8/5.0
Security Recommendations	88.6%	76.2%	4.5/5.0
Release Summaries	91.2%	94.5%	4.7/5.0

RLHF Learning Curve:

Week 1 Acceptance: 62.4% of AI recommendations

Week 2 Acceptance: 78.6% of AI recommendations

Correlation: Acceptance rate positively correlated with lead time reduction (r = 0.73, p < 0.05)

The security recommendation components also demonstrated high performance, achieving a precision of 88.6% and a recall of 76.2%, along with a strong satisfaction rating of 4.5. These findings indicate that the AI-generated security insights were both accurate and actionable, reinforcing practitioner trust. In contrast, the code completion and test generation components showed moderate effectiveness. Code completion achieved a precision of 72.4% and recall of 68.9%, with a satisfaction score of 4.2, whereas test generation exhibited slightly lower precision (65.8%) but higher recall (71.3%), corresponding to a satisfaction rating of 3.8. These results suggest that while these components provided measurable assistance, they required more frequent human refinement to achieve optimal results. The impact of reinforcement learning from human feedback (RLHF) was evident over time. The acceptance of AI recommendations increased from 62.4% in the first week to 78.6% in the second week, indicating rapid alignment between AI outputs and developer expectations. Moreover, acceptance rates were strongly correlated with reductions in lead time (r = 0.73, p < 0.05), suggesting that increased trust and effective human–AI interaction directly contribute to improved delivery performance.

4.11. Qualitative results from developer feedback

A post-intervention lite review was conducted with eight participants using a five-point Likert scale to assess the usability and perceived impact of the AI-assisted DevSecOps system. The average System Usability Scale (SUS) score was 78.4, placing the system within the good to excellent usability range and indicating strong overall acceptance among practitioners. Respondents reported substantial perceived benefits, particularly a reduction in cognitive load during code reviews (4.6/5.0) and faster identification of security issues (4.4/5.0). Improved confidence in release decisions (4.3/5.0) and reduced documentation burden (4.1/5.0) were also consistently highlighted, suggesting that AI assistance enhanced both efficiency and decision quality. Despite these positive outcomes, several challenges have been identified. Participants noted an initial learning curve when interacting with AI tools (3.2/5.0) and occasional irrelevant recommendations (3.4/5.0), underscoring the need for continuous model refinement. Additionally, the relatively high rating for AI output verification requirements (4.0/5.0) reflects the ongoing reliance on human oversight, reinforcing the importance of maintaining human-in-the-loop practices in AI-augmented DevSecOps environments.

4.12. Thematic analysis of open responses

Qualitative analysis of practitioner feedback revealed several emergent themes associated with AI-assisted DevSecOps adoption. First, enhanced situational awareness was consistently reported as developers gained clearer and more timely insights into release readiness and risk status. Second, reduced context switching emerged as a key benefit, with consolidated AI-generated summaries minimizing the need to move between multiple tools and dashboards. Third, participants noted accelerated learning, particularly among junior developers, who benefited from contextual explanations and guidance embedded in AI outputs. Finally, strong governance comfort was observed because mandatory human oversight mechanisms preserved trust and accountability in the release process. Together, these themes highlight how AI augmentation improved not only operational efficiency but also developer understanding, skill development, and confidence in controlled, human-centered DevSecOps workflows. Emergent themes: 1.

Enhanced Situational Awareness: Developers reported a better understanding of release readiness

Reduced Context Switching: Consolidated AI summaries minimized tool-hopping

Learning Acceleration: Junior developers benefited from AI explanations

Governance Comfort: Mandatory human oversight maintains trust in the system

4.13. Cost-benefit analysis

A cost–benefit analysis was conducted to evaluate the economic feasibility of the proposed AI-assisted DevSecOps implementation. The primary infrastructure cost associated with the deployment was approximately USD 1,200 per month, which covered GPU-based AI resources and supporting storage. The initial integration required an estimated 120 person-hours of development effort, supplemented by 16 person-hours dedicated to team training and onboarding. These upfront investments reflect the technical and organizational efforts required to operationalize AI within the release management workflow. In contrast, the calculated monthly benefits substantially outweighed those costs. Productivity gains resulting from reduced lead time and lower manual effort were estimated at USD 8,400 per month, based on 56 h saved at an average labor cost of USD 150 per h. Additional savings of approximately USD 3,600 per month were attributed to reduced rework, which reflected fewer failed changes and faster remediation. Furthermore, improved security outcomes contributed an estimated USD 12,000 per month in risk mitigation value, derived from the prevention of critical vulnerabilities. Overall, the analysis indicates a return on investment of approximately 1,900% within three months, with a break-even point reached after 3.2 weeks, underscoring the strong economic justification for the adoption of AI-assisted DevSecOps.

5. Discussion and implication

If lead times decrease significantly post-AI, this would support the hypothesis that on-prem AI can accelerate release cycles in DevSecOps. The expected mechanism is that RAG- or RLHF-powered code suggestions and automated test generation reduce manual coding and review time. Faster coding and early detection of issues would shorten the commit-to-deploy interval. This aligns with the literature, noting that generative AI “facilitated automation of coding tasks” in DevSecOps contexts. Moreover, AI-based static analysis or vulnerability scanning can be run continuously, reducing security review delays. A higher deployment frequency could emerge because less work and fewer bottlenecks allow for more changes per sprint. This is consistent with the notion that automation and small batch workflows improve throughput. If the change failure rate also declines, it suggests that AI did not sacrifice quality. In contrast, unchanged or worsened failure rates would signal the need to refine AI tools or retain manual oversight. This study has several limitations that must be acknowledged. The short duration (two sprints) limits statistical confidence; more iterations would strengthen the conclusions. Task complexity may not be perfectly uniform across sprints, potentially biasing the lead time. Developers’ learning curves with new AI tools could initially reduce productivity (a factor tracked qualitatively). RLHF tuning may require more feedback cycles than those that fit in one sprint. In addition, our internal tools (Flutter front-end,.NET/Python back-end) may respond differently to AI aids than open-source projects. Finally, the research measures only lead time; future work could measure related outcomes (e.g., code quality, team satisfaction): •

Practical Implications: Measurable DevSecOps maturity advancement.

•

Theoretical Contribution: Human-in-the-loop AI integration model.

•

Limitations: Single-organization, short-duration research.

•

Future Work: Longitudinal studies, predictive risk assessment.

5.1. DevSecOps maturity implications

The findings of this study indicate a clear advancement in DevSecOps maturity, characterized by a transition from largely ad hoc security practices to a more structured, automated, and policy-driven release-governance model. Security activities that were previously reactive and manually enforced have become embedded within the delivery pipeline, supported by continuous feedback and AI-assisted decision support. This shift reduced the reliance on individual expertise and informal processes, replacing them with repeatable and auditable controls. The observed improvements are consistent with established DevSecOps maturity models that emphasize early security integration, automation, and continuous monitoring across the software lifecycle. By enabling faster feedback loops and standardized policy enforcement, the AI-augmented approach supports higher levels of operational predictability and governance. Overall, the results suggest that AI-assisted DevSecOps can act as a maturity accelerator, helping organizations progress toward more resilient, scalable, and sustainable secure software delivery practices.

5.2. Role of AI-assisted release evaluation

AI-assisted release evaluation played a central role in enhancing situational awareness by consolidating the pipeline status, security findings, and compliance information into a single, coherent view of release readiness. This unified perspective reduces cognitive overhead and enables stakeholders to assess risks and progress more efficiently. Crucially, final release decisions remained under human control, ensuring that organizational governance, accountability, and ethical responsibility were preserved. The AI functions as a decision-support mechanism rather than an autonomous authority, reinforcing trust in the release process while improving the speed and quality of evaluation.

5.3. Managerial implications

For internal enterprise systems, the findings demonstrate that DevSecOps investments can yield measurable delivery benefits within short-sprint cycles. Automation reduces coordination overhead and supports more predictable release results.

Figure 2 highlights the “Robustness” achieved using the AI-augmented approach. The Baseline process, which is heavily reliant on manual verification, yielded a 28% failure rate, which was largely attributed to human oversight in the analysis of complex log files. The AI-augmented system reduced this to 12%. This significant reduction validates the effectiveness of RAG in retrieving critical error patterns from security logs and tickets, while RLHF ensures that the model’s approval criteria are aligned with the specific security standards of the organization, preventing “hallucinated” approvals.

Figure 2. Change Failure Rate Comparison Between Baseline and AI-Augmented DevSecOps Sprints.

Bar chart illustrates the percentage of deployments requiring remediation (hotfix or rollback) across the two sprint conditions. The Baseline sprint recorded a higher change failure rate (28%), primarily associated with manual log interpretation and delayed detection of security issues. Following AI augmentation, the failure rate decreased to 12%, representing improved release robustness. The reduction reflects the contribution of AI-assisted security scanning, contextual log retrieval through RAG, and policy-aligned validation refined via RLHF, supporting enhanced reliability without compromising deployment velocity.

5.4. DevSecOps maturity implications

The results reflect a measurable transition from ad hoc security integration to automated and policy-driven release governance, consistent with established DevSecOps maturity models. AI-assisted summaries function as a decision-support mechanism, enhancing situational awareness without displacing human authority. Analysis of the pipeline stages revealed that the most significant improvements occurred during the security validation and release approval phases. Build and test durations remained largely unchanged, indicating that efficiency gains were attributable to governance automation rather than increased development speed.

5.5. Limitations

The research was limited by its short execution period and evaluation within one organization. Moreover, the AI functionality was intentionally constrained to assistive summarization tasks, excluding predictive and autonomous decision-making. Consequently, the measured impact may underestimate the potential benefits of broader and more proactive AI integration approaches.

5.6. Interpretation of key findings

Lead Time Reduction Mechanism.

The observed 39.2% reduction in the lead time stems from three interconnected mechanisms: 1.

Parallel Processing Enablement: AI-assisted security scanning transforms a sequential bottleneck into a parallel process. Traditional security reviews require serial expert attention, whereas the AI system provides a preliminary analysis, enabling concurrent human validation.

Cognitive Load Reduction: Consolidated AI summaries reduce the information-processing burden on release managers. As expressed by one participant: “The AI doesn’t make decisions for us, but it tells us exactly what we need to look at”.

Early Feedback Integration: Real-time AI recommendations during development prevented security and quality issues from progressing through the pipeline, addressing the fundamental DevOps principle of “shifting left.” Quality Maintenance Paradox, contrary to the anticipated speed-quality tradeoff, we observed simultaneous improvement in both delivery speed and change quality. This paradoxical outcome can be explained by the amplification effect, wherein AI tools amplify human expertise rather than replacing it. Security experts can focus on complex vulnerability patterns, while AI handles routine checks, thereby increasing overall inspection coverage. Learning Feedback Loop: RLHF mechanisms create a virtuous cycle in which human decisions train the AI system, which in turn improves its recommendations for subsequent decisions.

5.7. Theoretical contributions

Extending DevSecOps Maturity Models, the findings of this study extend the established DevSecOps maturity frameworks by introducing AI-Augmented Maturity Levels:

Level 4 (AI-assisted): Traditional Level 4 (Quantitatively Managed) augmented with: •

Predictive quality gates based on historical patterns.

•

Intelligent risk-based approval routing.

•

Automated compliance documentation.

Level 5 (AI-optimized): Traditional Level 5 (Optimizing) enhanced with: •

Continuous model refinement via RLHF.

•

Adaptive pipeline configuration.

•

Prescriptive remediation guidance.

5.8. Human-AI collaboration model for DevSecOps

This study proposes a Complementary Intelligence Framework in which: •

AI excels in pattern recognition, consistency, scalability, and data synthesis.

•

Humans excel at contextual judgment, ethical considerations, complex reasoning, and exception handling.

•

Collaboration Interface: Structured handoffs with explicit accountability boundaries.

5.9. Practical implications for engineering managers

The findings of this study suggest a clear implementation roadmap for engineering managers seeking to adopt AI in DevSecOps practices. Initial deployments should emphasize assistance rather than autonomous AI capabilities, focusing on summarization and recommendation functions that support human judgment rather than replacing it. High-friction stages of the delivery pipeline, particularly security validation and compliance documentation, should be prioritized to maximize the early impact. In parallel, managers must invest in structured change management to address both technical integration challenges and organizational readiness. Establishing governance frameworks early, including clearly defined AI usage policies and accountability structures, is essential to ensure trust and compliance. Several successful factors were consistently identified across implementations. These include strong executive sponsorship, incremental rollout through controlled experimentation, transparent logging of AI-supported decisions to support auditability, and continuous feedback loops to guide ongoing system refinement and alignment with organizational needs.

5.10. For security and compliance teams

Transformational Opportunities: •

From Gatekeepers to Enablers: Shift from blocking releases to enabling secure acceleration.

•

Scalable Assurance: Leverage AI to extend security coverage without a proportional headcount increase.

•

Risk-Based Prioritization: Use AI risk scoring to focus expert attention on the highest-impact issues.

The limitations and boundary conditions of this study include methodological concerns such as internal validity threats, where learning effects from team familiarity with the pipeline may have influenced performance, the Hawthorne effect may have altered behavior due to awareness of observation, and inherent task variability persisted despite standardization efforts. Construct validity considerations also arise, as using lead time reduction as a proxy may not fully capture delivery value or customer impact, and the short two-sprint- duration restricts the evaluation of long-term- sustainability.

The generalizability of the findings is limited by several contextual factors, including organizational maturity, as the results may not extend to teams lacking established DevOps practices; regulatory environments, where heavily regulated industries may require distinct AI governance approaches; technical constraints, particularly in organizations without on-premises- AI infrastructure; and cultural readiness, as teams resistant to AI adoption may experience different outcomes. In addition, the technical limitations of AI systems must be considered, such as the restricted scope of knowledge bases that constrain retrieval-augmented- generation effectiveness, the risk of perpetuating organizational biases through training data, and the explainability gap created by the black-box- nature of certain AI recommendations, which can undermine trust in critical systems.

Revealed two emergent phenomena: the Expertise Amplification Effect, where junior developers benefited disproportionately from AI assistance, reducing lead time by 52% compared to 31% for senior developers, suggesting that AI may act as an expertise equalizer within teams; and the Documentation Paradox, in which automation lowered manual documentation effort yet overall documentation volume rose by 28%, though this increase produced documentation that was more structured in machine-readable formats, more traceable through links to specific code changes, and more actionable by being integrated into remediation workflows. Future research should prioritize longitudinal studies that assess the sustainability of AI augmentation across multiple quarters, replication across diverse industries, and team structures to validate findings and comprehensive economic analyses that include indirect benefits. On the technical side, key questions involve determining optimal human-AI task allocation strategies, developing explainable AI systems for DevSecOps and critical infrastructure, and advancing federated learning approaches to enable privacy-preserving model training across organizational boundaries. At the organizational level, research should explore the cultural and structural dynamics that influence successful AI adoption, examine how developer roles evolve with augmentation, and establish governance models and regulatory frameworks for AI-assisted software deliveries.

6. Conclusion, limitation, and future research

This study presents a sprint-based experimental evaluation of Agile DevSecOps release management, demonstrating a statistically significant lead-time reduction through automated security integration and AI-assisted decision support. This research contributes to both academic and industrial DevSecOps practices by providing controlled experimental evidence that on-premises generative AI can significantly reduce software delivery lead time by 39.2% (p < 0.01) without degrading quality. It introduces a practical plan–automate–monitor framework, enhanced with RAG and RLHF components, to guide the adoption of AI-assisted release management while ensuring human oversight and governance. In addition, this study proposes a comprehensive and replicable measurement methodology that enables a systematic quantitative evaluation of the impact of AI on software delivery, addressing a key gap in existing DevSecOps research. This study extends DevSecOps research by providing empirical, stage-level evidence of how assistive generative AI alters release dynamics in regulated enterprise settings. These findings challenge the traditional assumption of a speed–security trade-off by demonstrating that human-in-the-loop AI can simultaneously enhance delivery efficiency and governance effectiveness. For practitioners, this study offers a replicable measurement framework and implementation blueprint for integrating AI into DevSecOps pipelines without relinquishing human control. The results indicate that organizations can achieve measurable delivery improvements by targeting high-friction governance points, particularly security validation and release-approval processes. In an era of accelerating digital transformation and escalating cyber threats, the integration of generative AI into DevSecOps practices offers a promising path toward more resilient and responsive software delivery systems. By maintaining a principled focus on human oversight, auditability, and continuous improvement, organizations can harness AI’s potential of AI to not only accelerate their release cycles but also elevate the quality, security, and reliability of the software systems upon which modern society increasingly depends. This research is limited by its short duration and single organizational context. Future research should evaluate the longitudinal effects, multi-team deployments, and predictive risk assessment capabilities to further validate the scalability and sustainability of the proposed model. On-premises generative AI can significantly reduce the DevSecOps lead time while maintaining governance standards, offering a viable path for enterprises seeking to accelerate secure software delivery. The future of software delivery lies not in choosing between human expertise and artificial intelligence but in forging new partnerships that leverage the unique strengths of each, creating development ecosystems that are simultaneously more efficient, secure, and human-centered. As software systems become increasingly critical to economic and social infrastructure, the responsible integration of AI into development practices represents both competitive imperatives and ethical responsibilities. This research provides an initial roadmap for organizations embarking on this journey, emphasizing that the ultimate measure of success is not merely faster software delivery but more trustworthy, secure, and valuable software systems. The convergence of AI and DevSecOps represents not only technological evolution but also a fundamental reimagining of software delivery paradigms. Our findings suggest that when thoughtfully integrated with appropriate human oversight, AI systems can transform the traditional speed-security tradeoff into a synergistic relationship in which enhanced security enables accelerated delivery. In conclusion, this study demonstrates that thoughtfully integrated on-premises generative AI can serve as an effective decision-support mechanism in DevSecOps, enabling faster and more reliable software delivery while preserving essential security and compliance controls.

Ethical approval and consent

Data availability statement

Repository name: Assessing the Impact of AI-Augmented DevSecOps on Lead Time in Agile Release Management. https://doi.org/10.5281/zenodo.18830679 [ Agung Gunawan et al.,2026].

Underlying data

•

change_level_data.csv (Change-level dataset containing baseline and AI-augmented commit timestamps, deployment timestamps, and calculated lead times per change).

•

release_data.csv (Release-level dataset including ReleaseID, sprint identifier, number of changes, and failure status for each deployment).

•

rlhf_learning_curve.csv (Weekly AI recommendation acceptance rates and reinforcement learning from human feedback performance metrics corresponding to AI component evaluation).

•

statistical_tests.txt (Output of Welch’s two-sample t-test and associated statistical analysis results for lead-time comparison).

Extended data

Assessing the Impact of AI-Augmented DevSecOps on Lead Time in Agile Release Management. Zenodo. 2026. https://doi.org/10.5281/zenodo.18830679]. [ Agung Gunawan et al.,2026] •

Supplementals File 01 - Diagrams.docx: content of high diagram, git diagram and UML full processing.

•

supplementals File 02 - CODES.docx: context of the Python Codes used for the diagrams and processed.

All data and extended materials are available under the terms of the Creative Commons Attribution 4.0 International (CC BY 4.0) License.

Acknowledgement

The authors thank the Interdisciplinary School of Management and Technology, Institut Teknologi Sepuluh Nopember, Surabaya, for facilitating the entire study process. We are grateful for their invaluable support.

References

Abiona

Oladapo

Modupe

: The emergence and importance of DevSecOps: Integrating and reviewing security practices within the DevOps pipeline. World Journal of Advanced Engineering Technology and Sciences. 2024;11(2):127–133. 10.30574/wjaets.2024.11.2.0093

Adewusi

Daraojimba

Adaga

: AI in precision agriculture: A review of technologies for sustainable farming practices. World Journal of Advanced Research and Reviews. 2024;1:2276–2285. 10.30574/wjarr.2024.21.1.0314

Agarwal

Gupta

Choudhury

: Continuous and Integrated Software Development using DevOps. 2018, June 1. 10.1109/icacce.2018.8458052

Agarwal

: Test Case Automation: Transforming Software Testing in the Digital Era. International Journal of Computing and Engineering. 2024;6(5):52–58. 10.47941/ijce.2314

Ahmed

Francis

: Integrating Security with DevSecOps: Techniques and Challenges. 2019;178–182. 10.1109/icd47981.2019.9105789

Ajiga

Okeleke

Folorunsho

: Enhancing software development practices with AI insights in high-tech companies. Computer Science & IT Research Journal. 2024;5(8):1897–1919. 10.51594/csitrj.v5i8.1450

Akbar

Smolander

Mahmood

: Toward successful DevSecOps in software development organizations: A decision-making framework. Inf. Softw. Technol. 2022;147:106894. 10.1016/j.infsof.2022.106894

Amugongo

Mascheroni

Brooks

: Retrieval augmented generation for large language models in healthcare: A systematic review. PLOS Digital Health. 2025;4(6):e0000877. 40498738

10.1371/journal.pdig.0000877

PMC12157099

Arslan

Munawar

Cruz

: Business insights using RAG–LLMs: a review and case study. J. Decis. Syst. 2024;1–30. ahead-of-print (ahead-of-print). 10.1080/12460125.2024.2410040

Azonuche

Enyejo

: Agile Transformation in Public Sector IT Projects Using Lean-Agile Change Management and Enterprise Architecture Alignment. International Journal of Scientific Research and Modern Technology. 2024;3(8):21–39. 10.38124/ijsrmt.v3i8.432

Badshah

Khan

: Towards Process Improvement in DevOps. 2020;427–433. 10.1145/3383219.3383280

Bahi

Gharib

Gahi

: Integrating Generative AI for Advancing Agile Software Development and Mitigating Project Management Challenges. Int. J. Adv. Comput. Sci. Appl. 2024;15(3). 10.14569/IJACSA.2024.0150306

Bedoya

Palacios

Díaz-López

: Enhancing DevSecOps practice with Large Language Models and Security Chaos Engineering. Int. J. Inf. Secur. 2024;23(6):3765–3788. 10.1007/s10207-024-00909-w

Cervone

: Understanding agile project management methods using Scrum. OCLC Systems & Services: International digital library perspectives. 2011;27(1):18–22. 10.1108/10650751111106528

Donca

I-C

Stan

Misaros

: Method for Continuous Integration and Deployment Using a Pipeline Generator for Agile Software Projects. Sensors. 2022;22(12):4637. 35746421

10.3390/s22124637

PMC9231338

Dugbartey

Kehinde

: Optimizing project delivery through agile methodologies: Balancing speed, collabora-tion and stakeholder engagement. World Journal of Advanced Research and Reviews. 2025;25(1):1237–1257. 10.30574/wjarr.2025.25.1.0193

Pasuksmit

Tantithamthavorn

: AI for DevSecOps: A Landscape and Future Opportunities. ACM Trans. Softw. Eng. Methodol. 2025;34(4):1–61. 10.1145/3712190

Gajbhiye

Aggarwal

Jain

: Automated Security Testing in DevOps Environments Using AI and ML. International Journal for Research Publication and Seminar. 2024;15(2):259–271. 10.36676/jrps.v15.i2.1472

Gao

Xiong

Gao

: Retrieval-Augmented Generation for Large Language Models: A Survey. 2023. 10.48550/arxiv.2312.10997

Garg

Pundir

Rathee

: On Continuous Integration/Continuous Delivery for Automated Deployment of Machine Learning Models using MLOps. 2021;25–28. 10.1109/aike52691.2021.00010

Gargari

Habibi

: Enhancing medical AI with retrieval-augmented generation: A mini narrative review. Digital Health. 2025;11. 40343063

10.1177/20552076251337177

PMC12059965

Agung Gunawan

Singgih

Ginardi

RVH

: Assessing the Impact of AI-Augmented DevSecOps on Lead Time in Agile Release Management. 2026. 10.5281/zenodo.18830679

Hatch

Curry

: Changing how we evaluate research is difficult, but not impossible. elife. 2020;9. 10.7554/elife.58654

Hikov

Murphy

: Information retrieval from textual data: Harnessing large language models, retrieval augmented generation, and prompt engineering. Journal of AI, Robotics & Workplace Automation. 2024;3(2):142. 10.69554/qafe6376

Jeong

: A Study on the Implementation of Generative AI Services Using an Enterprise Data-Based LLM Application Architecture. Advances in Artificial Intelligence and Machine Learning. 2023;03(04):1588–1618. 10.54364/aaiml.2023.1191

Joel

Rajakumari

Hemalatha

: To Survive in a Fast-Changing Business Landscape in the Age of Digital Transformation, Organizations Must Be Flexible and Adaptive. Igi Global;2024;289–304. 10.4018/979-8-3693-3318-1.ch016

Junker

Bakker

Derks

: Agile work practices: measurement and mechanisms. Eur. J. Work Organ. Psy. 2022;32(1):1–22. 10.1080/1359432x.2022.2096439

Karamitsos

Thabit

Apostolopoulos

: Applying DevOps Practices of Continuous Automation for Machine Learning. Information. 2020;11(7):363. 10.3390/info11070363

Khan

Parahyanti

Hussain

: The Role of Generative AI in Human Resource Management: Enhancing Operational Efficiency, Decision-Making, and Addressing Ethical Challenges. Asian Journal of Logistics Management. 2024;3(2):104–125. 10.14710/ajlm.2024.24671

Knollmeyer

Caymazer

Grossmann

: Document GraphRAG: Knowledge Graph Enhanced Retrieval Augmented Generation for Document Question Answering Within the Manufacturing Domain. Electronics. 2025;14(11):2102. 10.3390/electronics14112102

Liang

: Enhancing Security in DevOps by Integrating Artificial Intelligence and Machine Learning. J Theory Pract Eng Sci. 2024;4(02):31–37. 10.53469/jtpes.2024.04(02).05

Nadella

Addula

Yadulla

: Generative AI-Enhanced Cybersecurity Framework for Enterprise Data Privacy Management. Computers. 2025;14(2):55. 10.3390/computers14020055

Neha

Bhati

Shukla

: Retrieval-Augmented Generation (RAG) in Healthcare: A Comprehensive Review. AI. 2025;6(9):226. 10.3390/ai6090226

Omran Almagrabi

Khan

: Optimizing Secure AI Lifecycle Model Management with Innovative Generative AI Strategies. IEEE Access. 2025;13:12889–12920. 10.1109/access.2024.3491373

Paasivaara

Durasiewicz

Lassenius

: Using Scrum in Distributed Agile Development: A Multiple Case Study. 2009;195–204. 10.1109/icgse.2009.27

Prates

Pereira

: DevSecOps practices and tools. Int. J. Inf. Secur. 2024;24(1). 10.1007/s10207-024-00914-z

Rangnau

Buijtenen

Fransen

: Continuous Security Testing: A Case Study on Integrating Dynamic Security Testing Tools in CI/CD Pipelines. 2020;145–154. 10.1109/edoc49727.2020.00026

Salo

Abrahamsson

: An iterative improvement process for agile software development. Softw Process Improv Pract. 2006;12(1):81–100. 10.1002/spip.305

Schmid

: Five years post-DORA: Promoting best practices for research assessment. Mol. Biol. Cell. 2017;28(22):2941–2944. 29084913

10.1091/mbc.e17-08-0534

PMC5662254

Shamsuddoha

Khan

Chowdhury

MMH

: Revolutionizing Supply Chains: Unleashing the Power of AI-Driven Intelligent Automation and Real-Time Information Flow. Information. 2025;16(1):26. 10.3390/info16010026

Singh

Ehtesham

Kumar

: Agentic Retrieval-Augmented Generation: A Survey on Agentic RAG. 2025. 10.48550/arxiv.2501.09136

Snyder

Curtis

: Using Analytics to Guide Improvement during an Agile–DevOps Transformation. IEEE Softw. 2018;35(1):78–83. 10.1109/ms.2017.4541032

Tomas

Huang

: An Empirical Study on Culture, Automation, Measurement, and Sharing of DevSecOps.

10.1109/cybersecpods.2019.8884935

Uludağ

Putta

Paasivaara

: Evolution of the Agile Scaling Frameworks. Springer;2021;123–139. 10.1007/978-3-030-78098-2_8

Ur Rahman

Williams

: Software security in DevOps. 2016;70–76. 10.1145/2896941.2896946

Wessel

Adam

Benlian

: Generative AI and Its Transformative Value for Digital Platforms. J. Manag. Inf. Syst. 2025;42(2):346–369. 10.1080/07421222.2025.2487315

Yigit

Buchanan

Tehrani

: Review of Generative AI Methods in Cybersecurity. 2024. 10.48550/arxiv.2403.08701

Gan

Zhang

: Evaluation of Retrieval-Augmented Generation: A Survey. 2024. 10.48550/arxiv.2405.07437

Zayat

Senvar

: Framework Study for Agile Software Development Via Scrum and Kanban. Int. J. Innov. Technol. Manag. 2020;17(04). 10.1142/s0219877020300025

Zhang

: Hallucination Mitigation for Retrieval-Augmented Large Language Models: A Review. Mathematics. 2025;13(5):856. 10.3390/math13050856

Zhao

Zhang

: Retrieval-Augmented Generation for AI-Generated Content: A Survey. 2024. 10.48550/arxiv.2402.19473

Zhou

: Advanced Embedding Techniques in Multimodal Retrieval Augmented Generation: A Comprehensive Study on Cross-Modal AI Applications. J Comput Electron Inf Manag. 2024;13(3):16–22. 10.54097/h8wf8vah

Zota

Bărbulescu

Constantinescu

: A Practical Approach to Defining a Framework for Developing an Agentic AIOps System. Electronics. 2025;14(9):1775. 10.3390/electronics14091775

10.5256/f1000research.196407.r487649

Reviewer response for version 1

Wibawa

Aji Prasetya

1 Referee 1Universitas Negeri Malang, Malang, East Java, Indonesia

Competing interests: No competing interests were disclosed.

8 6 2026

2026

This is an open access peer review report distributed under the terms of the Creative Commons Attribution Licence, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

recommendation

approve-with-reservations

The article investigates the impact of AI-augmented DevSecOps on lead time in Agile release management. The authors compare a conventional DevSecOps sprint with an AI-augmented sprint integrating retrieval-augmented generation, reinforcement learning from human feedback, and human-in-the-loop release support within an enterprise CI/CD environment. The study reports a substantial reduction in lead time for changes, with secondary improvements in security scanning time, approval latency, deployment frequency, and change failure rate. The topic is timely and relevant, especially given the growing interest in applying generative AI to software engineering, DevSecOps, security validation, and enterprise release management.

Overall, the manuscript addresses an important and practically relevant problem. The attempt to provide empirical evidence from an enterprise DevSecOps setting is valuable, and the focus on lead time for changes is appropriate. However, I recommend Approved with Reservations because several issues need to be addressed before the article can be considered scientifically sound and fully reproducible.

1. Clarity, presentation, and current literature

The work is partly clearly and accurately presented. The article cites a broad range of literature on DevSecOps, generative AI, RAG, RLHF, agile development, CI/CD, and security automation. However, the manuscript contains substantial repetition across the Introduction, Methods, Results, Discussion, and Conclusion sections. Several paragraphs restate the same claims about AI-assisted DevSecOps, lead-time reduction, and human-in-the-loop governance.

More importantly, parts of the Results section still appear to be written as if the study were a proposal rather than a completed empirical study. For example, wording such as “This study expects…” and references to tables being populated in a “real experiment” should be removed or rewritten. Since the manuscript already reports actual sprint-level results, all hypothetical or proposal-style language should be replaced with precise reporting of the completed study.

The literature coverage is generally adequate, but the authors should strengthen the positioning of the study against prior empirical work on DevOps metrics, DORA metrics, AI-assisted software engineering, DevSecOps automation, and quasi-experimental evaluations in software engineering. Some cited sources appear broad or only indirectly related, and the manuscript would benefit from a more selective use of high-quality and directly relevant references.

Points that must be addressed:

Remove all proposal-style or hypothetical wording from the Results and Discussion sections.

Reduce repetition across the manuscript.

Clarify the exact research gap and how this study differs from existing empirical DevOps and AI-assisted software engineering studies.

Ensure that all cited literature is directly relevant and current.

2. Study design and technical soundness

The study design is partly appropriate. A quasi-experimental within-team comparison is a reasonable starting point for evaluating AI augmentation in a real enterprise DevSecOps setting. Using the same team, comparable sprint duration, and operational CI/CD logs is appropriate for an applied software engineering study.

However, the design has important limitations. The study compares only two consecutive sprints: one baseline sprint and one intervention sprint. This design is vulnerable to confounding factors, including task complexity differences, learning effects, team adaptation, sprint planning differences, temporal effects, and Hawthorne effects. Although the authors mention control variables such as comparable story points, identical team composition, and similar infrastructure, the manuscript does not provide enough evidence that the two sprint backlogs were truly comparable.

The intervention also combines multiple AI-related components, including RAG, RLHF, code assistance, security recommendations, release summaries, and approval support. Because these components were introduced together, it is difficult to determine which component contributed most to the observed lead-time reduction. The manuscript discusses stage-level improvements, but the causal attribution to specific AI components remains limited.

Points that must be addressed:

Provide a clearer description of the two sprint backlogs, including task types, story points, complexity distribution, and domain similarity.

Explain how comparability between the baseline and intervention sprints was assessed.

Discuss confounding factors more explicitly.

Avoid overly causal language unless supported by the design.

Clarify which parts of the AI intervention were active in each pipeline stage.

3. Methods and replication details

The manuscript provides useful general information about the research setting, team composition, technology stack, CI/CD tools, and broad AI architecture. However, the methodological details are only partly sufficient for replication.

The authors should provide more details on the AI system configuration, RAG pipeline, retrieval corpus, embedding model, chunking strategy, retrieval parameters, prompt design, RLHF procedure, model refinement process, and human feedback protocol. The manuscript mentions FAISS, all-MiniLM-L6-v2, Llama 2 7B, JSONL audit logs, and weekly refinement cycles, but the exact operational implementation is not sufficiently detailed.

There is also a possible inconsistency in the hardware description. The manuscript refers to an on-premises Llama 2 7B infrastructure and also states that the workflow was executed without external GPU acceleration on an Intel Core i5 machine with 16GB RAM. This requires clarification because the computational setup affects reproducibility and feasibility.

Points that must be addressed:

Provide a technical configuration table for the AI system.

Describe the RAG knowledge base, document sources, preprocessing, chunk size, embedding model, vector index, retrieval top-k, prompt structure, and response validation process.

Describe the RLHF feedback process in more detail, including who provided feedback, how feedback was encoded, and whether the model was actually fine-tuned or only adjusted through feedback-based selection.

Clarify the hardware environment and whether GPU resources were used.

Provide enough pipeline configuration details so that another team could reproduce the intervention.

4. Statistical analysis and interpretation

The statistical analysis is partly appropriate. Welch’s t-test, effect size estimation, confidence intervals, and descriptive statistics are suitable for comparing lead time between two groups when variances may differ. Reporting Cohen’s d and a confidence interval is also helpful.

However, the interpretation should be more cautious because the sample size is limited to 42 changes, with 21 per sprint. The independence assumption should be discussed because changes within the same sprint and team may not be fully independent. The use of multiple metrics also raises the issue of selective interpretation. The manuscript should clarify whether statistical testing was applied only to lead time or also to secondary metrics such as security scan time, approval wait time, deployment frequency, and change failure rate.

There is also a major issue of numerical inconsistency. In one part of the manuscript, the change failure rate is reported as decreasing from 14.3% to 8.7%, while another figure and its caption report a decrease from 28% to 12%. These cannot both be correct unless they refer to different denominators or different subsets. This must be corrected because change failure rate is central to the claim that faster delivery did not compromise quality.

Points that must be addressed:

Correct the inconsistency in change failure rate values.

Clarify the denominator used for change failure rate.

Explain whether the 42 changes are independent observations.

Report whether statistical tests were conducted for secondary outcomes.

Interpret causality more cautiously due to the two-sprint quasi-experimental design.

5. Source data and reproducibility

The article states that source data and extended materials are available in a Zenodo repository. This is appropriate and aligns well with the F1000Research model. However, based on the manuscript alone, reproducibility appears only partial.

The repository is described as containing change-level data, release-level data, RLHF learning curve data, statistical test outputs, diagrams, and code. This is useful. However, full reproducibility requires the data dictionary, exact variable definitions, scripts used for statistical analysis, preprocessing steps, and the pipeline configuration used to generate the tables and figures. The manuscript should also clarify whether the source data are raw logs, anonymized operational data, or processed datasets.

Points that must be addressed:

Include a clear data dictionary for all CSV files.

Provide the scripts used to generate the reported statistics, tables, and figures.

Clarify whether the repository contains raw or processed data.

Explain any anonymization or aggregation procedures.

Ensure that the reported results in the manuscript can be reproduced directly from the shared data and scripts.

6. Support for conclusions

The conclusions are partly supported by the results. The reported data suggest that the AI-augmented sprint had lower lead time and improved several operational metrics. However, the strength of the conclusions should be moderated.

The manuscript sometimes implies that on-premises generative AI significantly reduces DevSecOps lead time while preserving governance and quality. This is plausible within the studied context, but the evidence comes from a short, single-organization, two-sprint evaluation. Therefore, the conclusion should be limited to this specific organizational setting and should avoid broad claims about regulated enterprise environments in general.

The claim that the findings challenge the traditional speed-security trade-off is interesting but somewhat overstated. The evidence supports a preliminary indication of simultaneous improvement in speed and selected reliability indicators within one case study, but not a general challenge to the broader speed-security trade-off across DevSecOps contexts.

Points that must be addressed:

Narrow the conclusion to the studied organization and sprint context.

Avoid broad generalization to all regulated enterprise environments.

Reframe the study as preliminary empirical evidence rather than definitive proof.

Align the conclusion with the limitations of the two-sprint quasi-experimental design.

Minor comments:

The abstract is informative, but it should be checked after correcting all numerical inconsistencies.

The Introduction contains repeated paragraphs and should be shortened.

The Methods section should distinguish more clearly between planned methodology and executed methodology.

Table numbering should be checked because Table 1 appears to be used for different purposes in different parts of the manuscript.

The Results section should not include sentences suggesting that the data are hypothetical.

Figure 2 and Table 3 should be reconciled because they report different change failure rate values.

The Conclusion is too long and repetitive. It should be shortened and focused on the key findings, limitations, and future work.

Some terminology should be standardized, including “DevOps,” “DevSecOps,” “lead time for changes,” “release readiness,” and “production deployment.”

The authors should proofread the manuscript carefully for grammar, spacing, and formatting inconsistencies.

The references should be checked for relevance and formatting consistency.

I recommend Approved with Reservations. The article addresses a relevant and timely topic and provides potentially useful empirical evidence on AI-augmented DevSecOps. However, the authors must address the issues above, especially the inconsistent numerical results, the remaining proposal-style language, the limited description of the AI intervention, the short two-sprint design, and the need for clearer reproducibility details. Once these issues are corrected, the article would make a useful contribution to the literature on AI-assisted DevSecOps and Agile release management.

Is the work clearly and accurately presented and does it cite the current literature?

Partly

If applicable, is the statistical analysis and its interpretation appropriate?

Partly

Are all the source data underlying the results available to ensure full reproducibility?

Partly

Is the study design appropriate and is the work technically sound?

Partly

Are the conclusions drawn adequately supported by the results?

Partly

Are sufficient details of methods and analysis provided to allow replication by others?

Partly

Reviewer Expertise:

The article has academic value and addresses an important topic, but it still needs specific revisions related to numerical consistency, methodological clarity, reproducibility, and interpretation of findings.

I confirm that I have read this submission and believe that I have an appropriate level of expertise to confirm that it is of an acceptable scientific standard, however I have significant reservations, as outlined above.