Afleveringen
-
A Java service keeps getting OOMKilled in Kubernetes even though memory requests look fine on paper. This episode explains why JVM heap defaults ignore container limits, how to set maximum heap size correctly, and what interviewers expect when they probe your understanding of Java memory in containerized environments. Covers Xmx flags, UseContainerSupport, native memory overhead, and the tradeoffs between requests and limits.
Full interview prep guides and scenario walkthroughs: DevOpsInterview.Cloud
-
When AWS fires the 2-minute Spot reclaim notice, Karpenter's interruption queue is the difference between a blip and a batch job disaster β here's exactly how to configure it.
You'll learn:
How to set karpenter.sh/capacity-type in a NodePool to prefer Spot with automatic On-Demand fallbackThe full interruption flow: SQS queue β cordon β graceful drain β pod rescheduling, all within the 2-minute windowWhy the order of values in the capacity-type array doesn't control selection β Karpenter uses price-capacity optimizationWhen to use strict values: ['spot'] and what happens when capacity dries upWhy Pod Disruption Budgets and gracefulTerminationPeriod are non-negotiable for fault-tolerant batch workloadsKeywords: Karpenter Spot interruption handling, Spot instance fallback on-demand, NodePool capacity type configuration, Kubernetes batch workload cost optimization, Spot 2-minute warning drain
π§ Listen, then go deeper β DevOps & Cloud interview-prep ebooks at DevOpsInterview.Cloud
-
Zijn er afleveringen die ontbreken?
-
Automated canary analysis for a Flink-based streaming app is a common senior SRE interview scenario β here's how to wire Prometheus, Loki, and Pyroscope into a production-grade rollout strategy.
You'll learn:
How to define canary success criteria using Prometheus metrics like consumer lag, throughput, and error rate on Flink jobsUsing Loki log queries to surface structured errors in canary vs. baseline deployments side-by-sideContinuous profiling with Pyroscope to catch CPU or memory regressions in the new Flink version before full rolloutHow automated analysis gates work β failing fast vs. baking time β and how to articulate the tradeoff in an interviewStitching observability signals into a single canary decision: pass, fail, or inconclusiveKeywords: canary deployment Flink, automated canary analysis SRE, Prometheus Loki Pyroscope, streaming app observability, DevOps interview questions
π§ Listen, then go deeper β DevOps & Cloud interview-prep ebooks at DevOpsInterview.Cloud
-
Grafana Mimir storage at 10TB/day scale forces real trade-offs β here's how to configure tiered storage to S3 without bleeding cost or tanking query performance.
You'll learn:
How Mimir's store-gateway and compactor interact with S3-backed object storage at high ingest volumeConfiguring blocks_storage with tiered retention β keeping hot blocks in fast storage while offloading cold blocks to S3 Glacier-compatible tiersTuning compaction schedules and chunk caching (memcached) to reduce S3 GET costs under sustained 10TB/day ingestCommon pitfalls: misconfigured bucket lifecycle policies, compactor overlap errors, and index cache misses killing query latencySizing ruler and alertmanager storage separately so they don't contend with block storage I/OKeywords: Grafana Mimir S3 storage, Mimir tiered storage config, Mimir compactor tuning, metrics storage at scale, Mimir blocks_storage
π§ Listen, then go deeper β DevOps & Cloud interview-prep ebooks at DevOpsInterview.Cloud
-
If your service has a 99.99% SLO and Azure drops a zone for 15 minutes, here's exactly how to calculate the error budget burn rate before your next SRE interview.
You'll learn:
How to derive total monthly error budget from a 99.99% SLO (~4.38 minutes/month)Why a 15-minute outage consumes roughly 3.4x your entire monthly budget β and how to show that mathThe burn rate formula interviewers expect: burn rate = error rate / (1 β SLO target)How fast vs. slow burn rates map to alerting windows in Google's SRE workbook approachCommon gotchas: partial zone failures, dependency blame, and how to frame mitigation in your answerKeywords: SLO error budget burn rate, Azure availability zone outage, SRE interview questions, error budget calculation, 99.99 SLO math
π§ Listen, then go deeper β DevOps & Cloud interview-prep ebooks at DevOpsInterview.Cloud
-
Designing a PCI-DSS compliant serverless payments architecture on GCP means getting Confidential VMs, Cloud External Key Manager, and Binary Authorization working together β here's how to answer that in a senior interview.
You'll learn:
How Confidential VMs provide hardware-level memory encryption to satisfy PCI-DSS data-in-use requirementsWhy Cloud External Key Manager (CEKM) lets you hold encryption keys outside GCP's control β and what that means for scope reductionHow Binary Authorization enforces cryptographic attestation so only verified container images reach your payment workloadsThe serverless boundary decisions (Cloud Run vs bare GKE) that affect your Cardholder Data Environment scopeCommon interview gotchas around shared responsibility, audit logging with Cloud Audit Logs, and VPC Service Controls for perimeter defenceKeywords: PCI-DSS GCP architecture, Confidential VMs interview, Cloud External Key Manager, Binary Authorization Cloud Run, serverless payments compliance
π§ Listen, then go deeper β DevOps & Cloud interview-prep ebooks at DevOpsInterview.Cloud
-
Deploying EKS clusters across AWS accounts with CDK is a common senior interview scenario β here's how to handle VPC peering, Transit Gateway attachments, and IAM trust policies correctly.
You'll learn:
How to structure a multi-account CDK app using Stacks across environments with explicit env account/region targetsWhen to use VPC peering vs Transit Gateway for cross-account EKS network connectivity, and the trade-offs at scaleHow to wire up Transit Gateway attachments and route table propagation so worker nodes can reach shared servicesCross-account IAM role assumptions and EKS RBAC config required for cluster access from a management accountCommon CDK gotchas: bootstrap trust policies, asset S3 bucket permissions, and cross-account CFN execution rolesKeywords: cross-account EKS CDK, AWS Transit Gateway EKS, VPC peering Kubernetes, multi-account EKS architecture, AWS CDK EKS interview
π§ Listen, then go deeper β DevOps & Cloud interview-prep ebooks at DevOpsInterview.Cloud
-
Correlating OpenTelemetry traces with CloudWatch Logs Insights across Lambda and Step Functions is a common senior interview scenario β here's exactly how to answer it.
You'll learn:
How to propagate trace context (W3C TraceContext headers) across Lambda invocations and Step Functions state transitions so trace IDs land in your structured logsConfiguring the AWS Distro for OpenTelemetry (ADOT) Lambda layer to auto-instrument functions without cold-start penaltiesWriting CloudWatch Logs Insights queries that join on trace_id to reconstruct an end-to-end execution timeline across servicesWhere correlation breaks β async Step Functions callbacks, missing X-Amzn-Trace-Id propagation, and log sampling mismatchesTrade-offs between ADOT, X-Ray native SDK, and a third-party collector like the OpenTelemetry Collector on FargateKeywords: OpenTelemetry Lambda tracing, CloudWatch Logs Insights trace correlation, ADOT Step Functions, serverless observability interview questions
π§ Listen, then go deeper β DevOps & Cloud interview-prep ebooks at DevOpsInterview.Cloud
-
Splitting a monolithic 4GB Terraform state file into scoped microstates is one of the nastiest live-infrastructure challenges you'll face β here's how to do it without downtime using terraform state rm and moved blocks.
You'll learn:
Why state files balloon past 4GB and why that breaks plan/apply performanceHow to use terraform state rm to surgically extract resources without destroying themUsing moved blocks to re-home resources into child state backends cleanlySequencing the migration to avoid drift, lock contention, and accidental deletesHow to validate microstate integrity with terraform state list and targeted plans before cutting overKeywords: terraform state splitting, terraform state rm, moved blocks terraform, monorepo to microstate migration, terraform refactor interview
π§ Listen, then go deeper β DevOps & Cloud interview-prep ebooks at DevOpsInterview.Cloud
-
Designing a monorepo CI pipeline that doesn't collapse under 1,000 microservices means getting Bazel remote caching and selective test execution right from the start.
You'll learn:
How to structure a monorepo CI pipeline so only affected services trigger builds β using Bazel's dependency graph to compute the minimal affected setConfiguring Bazel remote caching (local cache, shared remote cache via gRPC or HTTP) to avoid rebuilding unchanged targets across parallel CI workersSelective testing strategies: combining bazel query with --build_event_stream to identify and run only impacted test targetsCommon failure modes at scale β cache poisoning, overly broad BUILD file dependencies, and flaky remote executor connectionsHow to structure the CI orchestration layer (GitHub Actions, Buildkite, or Tekton) to fan out Bazel shards without thrashing the remote cacheKeywords: monorepo CI pipeline, Bazel remote caching, selective testing microservices, CI at scale DevOps interview, platform engineering build systems
π§ Listen, then go deeper β DevOps & Cloud interview-prep ebooks at DevOpsInterview.Cloud
-
Learn how to generate dynamic Azure RBAC role assignments using Pulumi with YAML-driven definitions β including tag-scoped conditions like restricting storage access to env:prod resources only.
You'll learn:
How to define custom Azure RBAC roles in YAML and hydrate them through Pulumi's automation layerUsing condition and conditionVersion fields in role assignments to enforce attribute-based access control (ABAC)Scoping storage permissions to resources matching specific tag key/value pairs at assignment timeStructuring Pulumi component resources so YAML definitions stay DRY across multiple environmentsCommon gotchas: condition syntax errors, propagation delays, and principal vs. scope mismatchesKeywords: Azure RBAC Pulumi, dynamic role assignments Azure, Pulumi YAML infrastructure, Azure ABAC tag conditions, custom RBAC roles interview
π§ Listen, then go deeper β DevOps & Cloud interview-prep ebooks at DevOpsInterview.Cloud
-
Taming Prometheus cardinality explosion in an Istio service mesh β dropping from 10 million to 500K active series using relabel_configs and recording rules β is exactly the kind of production war story senior SRE interviews dig into.
You'll learn:
Why Istio telemetry generates cardinality explosions and which high-cardinality labels (source_workload, destination_service, pod IPs) are the usual culpritsHow to use metric_relabel_configs to drop or rewrite labels before series are ingested into TSDB storageWriting recording rules to pre-aggregate high-resolution Istio metrics into lower-cardinality rollupsUsing topk and cardinality analysis queries to identify which metrics are burning your series budgetTrade-offs between dropping labels at scrape time versus aggregating at query time β and why interviewers care about the differenceKeywords: Prometheus cardinality, Istio metrics, relabel_configs, recording rules, TSDB series limit
π§ Listen, then go deeper β DevOps & Cloud interview-prep ebooks at DevOpsInterview.Cloud
-
A developer pushes a Terraform module with a public S3 bucket β here's exactly how to catch and block it in your Argo CD pipeline using Conftest policy-as-code before it ever reaches production.
You'll learn:
How Conftest integrates with Argo CD as a pre-sync hook to enforce OPA policies on Terraform plansWriting a Rego rule that flags acl = public-read or block_public_acls = false on aws_s3_bucket resourcesWhere in the GitOps workflow the gate fires β and why admission controllers alone aren't enough for IaC driftHow to surface policy failures as Argo CD sync errors so engineers see the violation before merge, not after deployCommon gotchas: Terraform plan JSON output format, conftest namespace mismatches, and false positives on legacy modulesKeywords: Conftest Argo CD policy, OPA Terraform GitOps, block public S3 bucket IaC, GitOps security controls, Rego policy Terraform plan
π§ Listen, then go deeper β DevOps & Cloud interview-prep ebooks at DevOpsInterview.Cloud
-
Managing a Terragrunt dependency graph across 500+ modules without hitting circular dependencies or version drift is one of the hardest scaling problems in platform engineering.
You'll learn:
How to map and audit a large Terragrunt dependency graph using terragrunt graph-dependencies and DAG visualisation toolsPatterns for structuring module hierarchies to prevent circular dependencies before they reach CIEnforcing module versioning with OCI registries β why OCI beats Git tags at this scaleHow to segment a 500+ module monorepo into dependency tiers so targeted runs stay fastCommon failure modes: implicit dependencies, missing mock_outputs, and run-all ordering bugsKeywords: Terragrunt dependency graph, Terragrunt at scale, OCI module registry, circular dependencies Terraform, platform engineering IaC
π§ Listen, then go deeper β DevOps & Cloud interview-prep ebooks at DevOpsInterview.Cloud
-
External Secrets Operator lets you sync HashiCorp Vault dynamic secrets directly into Kubernetes Secrets β no Vault Agent sidecars, no annotation sprawl.
You'll learn:
How ESO's ExternalSecret and SecretStore CRDs map Vault paths to Kubernetes SecretsWhy dynamic secrets (short-lived, auto-rotated) are preferable to static tokens and how ESO handles lease renewalThe auth methods ESO supports for talking to Vault β Kubernetes auth vs. AppRole and when to use eachCommon failure modes: stale secrets after Vault seal, RBAC misconfigs, and refresh interval gotchasHow to scope a ClusterSecretStore safely across namespaces without over-permissioningKeywords: External Secrets Operator, HashiCorp Vault Kubernetes integration, dynamic secrets management, Vault sidecar alternative, Kubernetes secrets sync
π§ Listen, then go deeper β DevOps & Cloud interview-prep ebooks at DevOpsInterview.Cloud
-
Parallel Jenkins jobs deploying Helm charts can deadlock silently β here's how to catch and fix mutex contention before it kills your pipeline.
You'll learn:
Why concurrent Helm deploys compete for the same release lock and how that surfaces as a deadlock in JenkinsHow to run jstack against the Jenkins JVM to capture thread dumps and identify which threads are waiting on a monitor lockReading mutex lock output to pinpoint the blocked executor and the thread holding itHelm-side mitigations: namespace isolation, --atomic flag behaviour, and serialising releases with lockfiles or pipeline lock() stepsWhen to escalate from a workaround to a structural fix β separate agents, dedicated namespaces, or a Helm operator patternKeywords: Jenkins parallel jobs deadlock, Helm chart deployment lock, jstack thread dump Jenkins, mutex lock CI/CD pipeline, Jenkins pipeline concurrency
π§ Listen, then go deeper β DevOps & Cloud interview-prep ebooks at DevOpsInterview.Cloud
-
Learn how to enforce CloudFormation stack drift detection at scale using AWS Config rules and Lambda-driven auto-remediation β a common architecture question in senior Cloud and DevOps interviews.
You'll learn:
How AWS Config detects configuration drift against CloudFormation expected stack states using managed and custom rulesWiring an EventBridge rule to trigger a Lambda function when Config flags a stack as DRIFTEDLambda remediation patterns: re-running cloudformation detect-stack-drift vs. forcing a stack update to reconcile out-of-band changesGotchas around drift detection cost, IAM permissions for the Config recorder, and distinguishing intentional changes from real driftHow to scope remediation safely β alerting vs. hard auto-rollback and when each is appropriate in productionKeywords: CloudFormation drift detection, AWS Config auto-remediation, Lambda CloudFormation remediation, IaC drift enforcement, AWS Config rules interview
π§ Listen, then go deeper β DevOps & Cloud interview-prep ebooks at DevOpsInterview.Cloud
-
Reducing DynamoDB Global Tables data transfer costs by 70% is achievable in a multi-region Active-Active setup β if you know where the money is actually going.
You'll learn:
Why replicated write costs dominate in DynamoDB Global Tables and how to model them accuratelyUsing write sharding and conditional writes to reduce unnecessary replication trafficDAX (DynamoDB Accelerator) placement per region to cut cross-region read fallbackArchitecting read patterns to stay local β avoiding the latency and cost of cross-region readsCost monitoring with AWS Cost Explorer tags scoped to replication vs. application trafficKeywords: DynamoDB Global Tables cost optimization, multi-region Active-Active AWS, DynamoDB replication costs, AWS data transfer cost reduction
π§ Listen, then go deeper β DevOps & Cloud interview-prep ebooks at DevOpsInterview.Cloud
-
When a database migration fails mid-deploy, your Kubernetes job hooks and Flyway versioning strategy are the difference between a five-minute fix and a 2am incident.
You'll learn:
How to structure Flyway versioned and undo migrations so a failed V3 doesn't leave your schema in a half-applied stateUsing Kubernetes init containers and Job postStart/preStop hooks to gate application rollout on migration success or failureWhy flyway repair matters when checksums break and how to use it safely in CI pipelinesPatterns for keeping application code and schema changes in sync across canary and blue-green deploymentsWhat interviewers actually want to hear when they ask about zero-downtime schema migrations at scaleKeywords: Flyway rollback strategy, Kubernetes job hooks database, schema versioning DevOps interview, failed database migration recovery
π§ Listen, then go deeper β DevOps & Cloud interview-prep ebooks at DevOpsInterview.Cloud
-
When terraform apply times out creating 100+ IAM roles, the culprit is usually AWS API throttling combined with Terraform's default parallelism β here's how to fix it.
You'll learn:
Why the default parallelism=10 isn't always safe and when raising it to -parallelism=50 helps vs. hurtsHow AWS IAM's eventual-consistency model causes race conditions during bulk role creationBatching strategies: splitting large role sets across multiple terraform apply runs or using for_each with targeted appliesReading AWS API throttle errors in Terraform debug output (TF_LOG=DEBUG) to confirm the real bottleneckExponential backoff and retry tuning via the AWS provider's max_retries settingKeywords: terraform apply timeout, AWS IAM role throttling, terraform parallelism, terraform at scale, IAM API rate limits
π§ Listen, then go deeper β DevOps & Cloud interview-prep ebooks at DevOpsInterview.Cloud
- Laat meer zien