Detail	Info
Exam code	AZ-305
Full name	Designing Microsoft Azure Infrastructure Solutions
Questions	40–60 questions (case studies, scenario-based, drag-and-drop)
Passing score	700 / 1000
Duration	120 minutes
Cost	$165 USD
Prerequisite	AZ-104 Azure Administrator (required)
Renewal	Annual free online assessment

AZ-305 exam domain weights

Domain 1 — Design Identity, Governance & Monitoring 25–30%

Domain 2 — Design Data Storage Solutions 15–20%

Domain 3 — Design Business Continuity Solutions 10–15%

Domain 4 — Design Infrastructure Solutions 25–30%

Course modules

Module 013 lessons

Azure Governance & Identity Design

Management Group hierarchy for enterprise-scale; Azure Policy initiative scopes and effects (Deny, Audit, Modify, Append); RBAC least-privilege patterns with Contributor vs Owner vs custom roles; resource locks (CanNotDelete vs ReadOnly); PIM eligible assignments and approval workflows; Azure Managed Applications for self-service catalogs; Budget action groups with Automation Runbooks.

📖 Read in-depth chapter ▾

Lesson 1.1 Management Group hierarchy and subscription topology

AZ-305 is a design exam, so questions on governance test your topology skills more than your config. Management Group hierarchy is the canonical "draw the org chart" question that opens most exams. Pick the right shape on day one or unwind it later under pressure.

Key concepts

Tenant Root Group: the implicit ceiling of every tenant. Policies and role assignments here apply to every subscription. Don't attach RBAC here unless you really mean "everyone, everywhere".
Recommended pattern: Tenant Root → Platform / Landing Zones / Sandbox / Decommissioned (functional grouping under root), with subscriptions placed by lifecycle stage and workload type. Cloud Adoption Framework (CAF) ships this layout as a starting point.
Subscription as billing AND blast-radius boundary: spending limits, quota limits, and policy scope all stop at the subscription. Greenfield landing zones almost always carve one subscription per business unit per environment (prod/non-prod) to bound blast radius.
Move costs: moving subscriptions between management groups is cheap, but moving them between tenants is brutal (migration, not move). Get the tenant right at the start — multi-tenant designs are an AZ-305 anti-pattern unless legally required.
Empty MGs as policy anchors: hold a future-state subscription scope by creating an empty MG that already has policy attached. New subscriptions are placed there and inherit the policy instantly, no retro work.

Concrete example

An enterprise with 40 business units wants central platform services (network, security, identity) separate from workload subscriptions, and a sandbox for experimentation that is policy-isolated. Design: under Tenant Root, four MGs — Platform (connectivity / management / identity subs), Landing Zones (split into Corp and Online sub-MGs), Sandbox, Decommissioned. Per-BU MGs under Landing Zones. Production-grade policies attach at Landing Zones; relaxed policies at Sandbox. Decommissioned MG has Deny-all so accidentally-leaving subscriptions can't be used.

Key takeaway: Management Group topology is policy and RBAC inheritance, not org chart. Adopt the CAF skeleton, then specialise. Get the tenant right at the start — moves between MGs are cheap, moves between tenants are migrations.

⚡ Mini-quiz

Drill MG-topology scenarios → study mode (10 questions).

Lesson 1.2 Azure Policy at scale — effects, initiatives, exclusions

AZ-305 expects you to pick the right policy effect for the right control. Each effect has different ergonomics and different remediation behaviour. Pick wrong and you either fail to enforce (Audit when you needed Deny) or break legitimate workflows (Deny when you needed Audit).

Key concepts

Effects: Deny (block create/update at API), Audit (log non-compliance, allow it), Append (add missing fields at create), Modify (add/remove/replace tags or properties, supports remediation), DeployIfNotExists (deploy missing resource — e.g., Log Analytics agent), AuditIfNotExists (log if a related resource is missing), Disabled.
Initiatives (policy sets): bundle multiple policies and assign together. Built-in initiatives like Azure Security Benchmark ship hundreds of controls; you tweak parameters per assignment. Tracking compliance at initiative level is easier than per-policy.
Assignment scope and exclusions: assign at MG/subscription/RG, exclude specific child scopes via notScopes. Exclusions are surgical — exclude a single resource that needs the exception without disabling the policy globally.
Parameters: almost every built-in policy is parameterised. Same definition can be assigned with different allowed-locations lists per region or different required-tag values per BU.
Remediation tasks: for Modify and DeployIfNotExists effects. Existing non-compliant resources need a remediation task to be brought into line. The task runs under the assignment's managed identity.

Concrete example

A regulated workload must (1) deploy only to eastus2 or westus2, (2) carry the CostCenter tag, (3) auto-deploy the Microsoft Defender for Cloud agent. Solution: an initiative containing the built-in Allowed locations policy (Deny), the Require a tag policy (Deny), and the Configure ASC monitoring agent policy (DeployIfNotExists with remediation managed identity). Assign at the regulated workload's MG. Run a remediation task to bring existing resources into compliance.

Key takeaway: Deny for hard rules, Audit for visibility, Modify/DeployIfNotExists when you need to actually enforce something on existing resources. Bundle into initiatives — exam scenarios almost always pair multiple policies.

⚡ Mini-quiz

Practise policy-effect decisions → quick quiz (5 questions).

Lesson 1.3 PIM, custom roles, and identity-design trade-offs

Privileged Identity Management (PIM) turns standing access into just-in-time elevation. Custom roles fill the gaps in built-in role coverage. Together they let you satisfy enterprise audit requirements without making people miserable.

Key concepts

PIM eligibility vs active assignment: Eligible = "can request activation, gets the role for X hours". Active = "has the role right now". Eligible-only design eliminates standing high-privilege access.
Approval workflow: activation requires approval from a configured set of approvers. Combine with MFA, justification, ticket number requirement. Audit log captures everything for compliance.
Access reviews: recurring reviews where designated reviewers attest that listed users still need their roles. Auto-removes users who don't respond or whose reviewer denies. Required for ISO 27001 / SOC 2 evidence.
Custom roles: JSON definition with actions, notActions, dataActions, notDataActions, and a list of assignableScopes. Use when no built-in matches — e.g., "restart VMs but never delete them", "read Key Vault secrets but never list them".
Identity-design trade-offs: hybrid identity has three sync modes — Password Hash Sync (PHS) (simple, default), Pass-Through Auth (PTA) (auth happens on-prem), Federation (ADFS) (legacy, complex). AZ-305 favours PHS for most scenarios; PTA only when on-prem must validate every auth; federation only when a hard requirement (smart cards, etc.) exists.
Conditional Access: AZ-500 territory but appears on AZ-305 in design questions. Build policies around user/group, location, app, sign-in risk, device state. Grant controls (require MFA, compliant device) and session controls (session lifetime, app enforced restrictions).

Concrete example

A regulated bank wants: (1) no standing access to subscription-level Owner roles, (2) MFA required for elevation, (3) quarterly access reviews on all Owner-eligible users. Design: assign Owner via PIM Eligible, set activation max 4 hours with MFA + approval + justification required. Configure quarterly access reviews on the Eligible assignments, with the manager-of as reviewer and "automatic removal if no response" turned on. Pair with a custom role SubscriptionViewer for the 90% of users who only need read access.

Key takeaway: PIM Eligible + access reviews is the AZ-305 answer for enterprise privilege management. Custom roles fill the built-in gaps. Hybrid identity defaults to PHS unless requirements force PTA or federation.

⚡ Mini-quiz

Drill PIM and custom-role scenarios → study mode (10 questions).

Module 023 lessons

Hybrid Identity & Monitoring

Pass-Through Authentication (PTA) vs Password Hash Sync vs ADFS federation for compliance requirements; Azure Monitor Agent (AMA) with Data Collection Rules (DCRs); Application Insights for APM and distributed tracing; Log Analytics workspace retention tiers (Analytics vs Basic vs Auxiliary); dual-destination Diagnostic Settings pattern for cost-optimized compliance logging; Management Group subscription hierarchy design.

📖 Read in-depth chapter ▾

Lesson 2.1 Hybrid identity sync — PHS, PTA, federation

Hybrid identity is the bridge between on-prem Active Directory and Microsoft Entra ID. Three sync modes have different security postures, operational costs, and failure characteristics. The exam asks you to map a compliance requirement to the right mode.

Key concepts

Password Hash Sync (PHS): Entra Connect syncs a hash of the on-prem password hash to Entra ID. Users authenticate at Entra; on-prem doesn't need to be reachable. Simplest, default, supports Entra ID smart features (leaked-credential detection, Conditional Access). Most teams should pick this.
Pass-Through Authentication (PTA): Entra Connect installs lightweight agents on-prem that validate passwords against on-prem AD on every sign-in. Password never leaves on-prem. Use when policy requires the password validation to happen on-prem. Requires multiple agents for HA.
Federation (ADFS / third-party): Entra ID redirects auth to on-prem ADFS or a SAML IdP. Most operationally expensive (own ADFS farm) and most fragile. Used only when smart cards, third-party MFA, or specific token claims are non-negotiable.
Seamless SSO: can be added to PHS or PTA — Kerberos-based, no add'l prompt for users already signed into the on-prem domain. Federation has SSO inherent.
Entra Cloud Sync: newer lightweight alternative to Entra Connect for simple sync scenarios. Supports multi-forest from a single sync engine. Doesn't yet match Connect's full feature set; check current docs before committing.
Disaster scenarios: PHS keeps auth working even when on-prem is offline. PTA fails closed. Federation fails closed AND requires complex Multi-Site / WAP farm DR. PHS is the cheapest "still works when AD is down" option.

Concrete example

A 5,000-user firm migrates to Microsoft 365 and Entra ID. Compliance dictates "password validation must happen on-prem". Security wants leaked-credential detection. Both are required. Choice: PTA + PHS together — PTA validates the actual sign-in; PHS sends hashes for leaked-credential detection (which compares to Microsoft's known-bad list, not as primary auth). Add Seamless SSO for domain-joined laptops. Federation explicitly avoided as overkill.

Key takeaway: PHS for default and disaster resilience. PTA when on-prem MUST validate. Federation only when claims/MFA/smart-card requirements force it. PHS + PTA combined is a legitimate hybrid pattern.

⚡ Mini-quiz

Drill hybrid sync decisions → study mode (10 questions).

Lesson 2.2 Monitoring architecture — Log Analytics, AMA, DCR

AZ-305's monitoring questions are about architecture, not config. Pick the right workspace topology, the right collection mechanism, and the right retention tier — get any wrong and you either lose evidence or pay ten times the necessary cost.

Key concepts

Log Analytics workspace topology: one workspace per region per data-sovereignty boundary is the common pattern. Resist the "one global workspace" temptation — egress to another region costs more than per-workspace overhead.
Azure Monitor Agent (AMA): the modern unified agent (replacing the legacy Log Analytics Agent / OMS Agent). Targets VMs / VMSS / Arc-enabled servers via association with Data Collection Rules.
Data Collection Rules (DCR): declarative resource that specifies WHAT to collect (perf counters, syslog, custom logs) and WHERE to send it. Multiple DCRs can target the same VM via Data Collection Rule Associations (DCRA). Replaces per-workspace agent config of the old model.
Retention tiers: Analytics (full-feature, queryable in the standard UI, expensive) — 30/60/90 days standard. Basic (limited query, cheaper, KQL search + filters only) — useful for long-tail debug data. Archive (long-term, restore-to-query) — for compliance retention.
Dual-destination pattern: send the same data to BOTH a Log Analytics workspace (for queryable hot data, 30-day retention) AND a Storage Account (long-term archive at 1/10th the cost). DCRs make this straightforward.
Application Insights: APM + distributed tracing for apps. Workspace-based mode is the only supported option now — the AI data lives in your Log Analytics workspace. Used for end-to-end transaction tracing, dependency maps, and live-traffic exception streaming.

Concrete example

A global retailer with strict 7-year audit-log retention and a 90-day operational retention. Design: one Log Analytics workspace per region for operational data (90-day retention on Analytics tier). For audit logs: separate Diagnostic Setting destinations on each workload — Activity Log + key resources stream to BOTH the regional workspace AND a centralised Storage Account with immutability locked for 7 years. AMA + DCRs handle the VM-side collection; Application Insights covers the application tier with workspace-based mode.

Key takeaway: per-region workspaces, AMA + DCRs for unified collection, dual-destination for hot+cold retention, Application Insights in workspace-based mode. Resist global-workspace centralisation.

⚡ Mini-quiz

Practise monitoring topology scenarios → quick quiz (5 questions).

Lesson 2.3 Alerts, action groups, and Sentinel integration

Logs only matter if they fire alerts. AZ-305 tests your design judgment on which alert type for which condition, and on whether Sentinel belongs in the architecture.

Key concepts

Alert rule types: Metric (numeric time series), Log query (KQL on Log Analytics), Activity log (control-plane events), Smart Detector (Application Insights anomalies). Pick by what your condition can be expressed as.
Action groups: reusable notification bundles — email, SMS, voice call, push, webhook, Logic App, Azure Function, ITSM. One action group used by many alerts; updating the on-call rotation is one operation, not 50.
Action rules: override notification behaviour by scope/filter — suppress all alerts during a maintenance window, change severity, route by tag.
Workbooks and dashboards: Workbooks are interactive parameterised reports built on KQL. Dashboards aggregate metric tiles, log charts, and pinned visuals into a single view. Both deploy as ARM resources alongside the workload.
Microsoft Sentinel: SIEM + SOAR built on top of Log Analytics. Adds analytics rules (KQL detections), incidents, automation playbooks (Logic Apps), and threat intelligence. Use when security ops needs queue-style incident management — not just "alert someone".
Defender for Cloud vs Sentinel: Defender is the CSPM + workload-protection layer (vulnerabilities, misconfig, agent-based runtime protection). Sentinel is the SIEM. Most enterprises run both — Defender feeds findings into Sentinel as alerts.

Concrete example

A regulated workload needs: (1) page on-call when VM CPU > 90% for 10 minutes, (2) page security when a privileged-account sign-in happens from an unusual location, (3) SOC analyst workflow for security incidents. Design: (1) Metric alert on the VM Insights CPU metric → action group ag-oncall (PagerDuty webhook). (2) Log query alert on a SigninLogs KQL query → action group ag-security + auto-create incident in Microsoft Sentinel. (3) Sentinel analytics rule with the same logic + playbook to auto-disable the user via Logic App on critical risk. SOC works the Sentinel incidents queue.

Key takeaway: metric / log query / activity log alerts handle most cases; Sentinel adds incident workflow when security ops needs a queue. Action groups own routing; alerts own conditions; don't conflate them.

⚡ Mini-quiz

Drill alert + Sentinel integration scenarios → study mode (10 questions).

Module 033 lessons

Data Storage Solutions

Blob lifecycle management policies (Hot → Cool → Archive tier transitions); Azure SQL Database Serverless auto-pause for dev/test cost optimization; Cosmos DB NoSQL API multi-region write (multi-master) and consistency levels; ADLS Gen2 hierarchical namespace for analytics workloads; Azure Files Premium with SMB and AD DS integration; Azure Cache for Redis Standard vs Premium with geo-replication; IoT Hub + Stream Analytics + Synapse pipeline architecture.

📖 Read in-depth chapter ▾

Lesson 3.1 Picking the right Azure data store

AZ-305 doesn't ask "which one is fastest"; it asks "which one fits these constraints". The exam loves scenarios where you must match data shape, access pattern, latency, cost, and consistency to the right Azure data service. Memorise the decision tree.

Key concepts

Relational, OLTP: Azure SQL Database for fully managed (DTU or vCore, Serverless option for dev/test). Azure SQL Managed Instance for migrating SQL Server with SQL Agent, cross-DB queries, near-100% SQL Server feature compatibility.
NoSQL document / KV / graph: Cosmos DB across five APIs (Core SQL, MongoDB, Cassandra, Gremlin, Table). Globally distributed, multi-region writes, RU-based capacity, five consistency levels.
Analytical / big data: ADLS Gen2 (hierarchical namespace on Blob — POSIX-like ACLs, ideal for Spark/Hadoop), Azure Synapse Analytics (data warehouse + Spark + Pipelines in one), Databricks (managed Spark).
Object storage: Blob Storage for unstructured. Hot / Cool / Cold / Archive tiers. Use Blob lifecycle policies to age data through tiers automatically.
File shares: Azure Files for SMB/NFS shared file systems — Premium tier for low-latency, AD DS integration for kerberos auth. Azure NetApp Files for high-IOPS / enterprise NAS workloads.
Cache: Azure Cache for Redis — Premium tier supports VNet integration, clustering, geo-replication.
Streaming + IoT: IoT Hub (millions of devices), Event Hubs (high-throughput ingestion), Stream Analytics (T-SQL on streams).

Concrete example

A retailer's data platform: (1) the catalog (1M products, looked up by SKU, <10ms target) → Cosmos DB Core SQL with session consistency. (2) The transaction store (orders, foreign keys, ACID) → Azure SQL Database Business Critical. (3) Cold order history (5+ years, compliance retention) → Blob Storage with lifecycle policy Hot → Cool at 30 days → Archive at 1 year. (4) Real-time clickstream → Event Hubs → Stream Analytics → Synapse for analytics. Five purpose-built services, none of them stretched outside their sweet spot.

Key takeaway: match data shape and access pattern to the purpose-built service. Don't stretch one service across all workloads. AZ-305 questions almost always have one obvious "ideal" answer once you classify the workload correctly.

⚡ Mini-quiz

Drill data-service selection scenarios → study mode (10 questions).

Lesson 3.2 Cosmos DB — partitioning, RUs, and consistency

Cosmos DB shows up in nearly every AZ-305 exam. The defining decisions are partition key, throughput model (provisioned RUs vs serverless vs autoscale), and consistency level. Get these wrong and you either overpay massively or hit throttling under load.

Key concepts

Partition key: chosen at container creation, immutable. Drives physical partition placement. Pick a property with high cardinality and even access distribution — same rules as DynamoDB. Hot partitions cap your effective throughput at 10k RUs regardless of total provisioned.
Request Units (RU): abstract throughput unit covering CPU + IOPS + memory. 1 KB point read = 1 RU; 1 KB write = ~5 RU. Provisioned throughput pays for reserved RU/s; serverless pays per million RU consumed; autoscale scales between min and max (paying for max consumed in any hour).
Throughput modes: Provisioned for predictable steady traffic. Serverless for <5,000 RU bursts. Autoscale for variable traffic where peak <10× minimum. Switch between provisioned and autoscale freely; serverless is a separate account type.
Consistency levels: Strong (linearisable, no multi-region writes), Bounded staleness (read lags by <K writes or T seconds), Session (read your own writes — default for SDK), Consistent prefix, Eventual. Trading off stronger consistency for higher latency.
Multi-region writes: every region is writeable. Conflict resolution by last-writer-wins (default), custom UDF, or conflicts feed for manual handling. Required for active-active geo deployments.
Indexing policy: by default everything is indexed (expensive on writes). Customise to exclude properties you never query on — significant write-RU savings on write-heavy workloads.

Concrete example

A global IoT platform stores 10M device records, accessed by deviceId 99% of the time and by region 1% of the time. Design: container with partitionKey: /deviceId (10M distinct values — perfect cardinality, even distribution). Throughput: autoscale 4,000-40,000 RU/s (10× variance). Consistency: session (SDK default, sufficient for IoT use case). Indexing policy excludes the verbose telemetry payload property — saves ~30% on write RUs. Multi-region writes enabled in three regions for active-active. Region queries handled by a separate GSI-style approach with a synced read-only collection partitioned by region.

Key takeaway: partition key + RU model + consistency level are the three defining Cosmos DB decisions. Most enterprise scenarios end at session consistency with autoscale throughput. Customise the indexing policy on write-heavy workloads.

⚡ Mini-quiz

Practise Cosmos DB partition + RU decisions → quick quiz (5 questions).

Lesson 3.3 Storage redundancy, lifecycle policies, and access control

The "storage detail" questions on AZ-305 are about durability options (LRS/ZRS/GRS/GZRS), automated lifecycle tiering, and access patterns (SAS / RBAC / private endpoints). Get the redundancy choice right based on RPO requirement; let lifecycle policies do the cost optimisation.

Key concepts

Redundancy options: LRS (3 copies in one DC, cheapest), ZRS (3 copies across 3 AZs, survives AZ outage), GRS (LRS + async geo-replication to paired region), GZRS (ZRS + geo). Read-Access variants (RA-GRS, RA-GZRS) make the secondary readable.
Storage account kinds: StorageV2 (GPv2, modern default), BlockBlobStorage (Premium SSD-backed, low-latency for blobs), FileStorage (Premium for Files). Skip GPv1 — it can't do access tiers.
Blob lifecycle policies: JSON rules on a storage account that transition blobs between tiers and delete them based on age or last access. Daily evaluation, no extra cost. Standard pattern: Hot → Cool at 30d, Cool → Archive at 90d, delete at 7y.
Immutable storage (WORM): time-based retention policy or legal hold. Once locked, even the owner can't delete until the policy expires. Required for SEC 17a-4 / FINRA compliance.
Access patterns: SAS (account / service / user-delegation — prefer user-delegation, key-less and revocable), RBAC on data plane (Storage Blob Data Reader / Contributor — for service principals and managed identities), private endpoints (subnet-attached private IP — modern default for VNet-attached workloads), storage firewall (IP + VNet rules with default-deny).
Encryption: at rest by default (Microsoft-managed key). Customer-managed keys (CMK) for compliance — key stored in Azure Key Vault, audited rotation. Encryption in transit via "Secure transfer required" (TLS 1.2 minimum).

Concrete example

A media archive holds petabytes of video files with 7-year retention. Access pattern: heavy in first 30 days, sporadic for 90 days, near-zero after. Design: StorageV2 with GZRS redundancy (survives region outage). Lifecycle policy: Hot → Cool at 30 days → Cold at 90 → Archive at 180 → delete at 7 years. Time-based immutability (locked 7y) on the legal-hold container. Access via private endpoints from the video-processing VNet; public endpoint disabled. Application identities use RBAC on data plane; external auditors get scoped user-delegation SAS tokens, 1-hour expiry.

Key takeaway: redundancy choice tracks RPO requirement. Lifecycle policies replace manual tiering. User-delegation SAS replaces account-key SAS for revocable, scoped access. Private endpoints replace service endpoints for new builds.

⚡ Mini-quiz

Drill redundancy + lifecycle + access scenarios → study mode (10 questions).

Module 043 lessons

Business Continuity & Disaster Recovery

Azure Site Recovery (ASR) for VMware-to-Azure replication: RPO as low as 5 minutes, automated recovery plans; Azure SQL Business Critical tier with Auto-Failover Groups: cross-region automatic failover with DNS listener abstraction; VMSS deployment across 3 Availability Zones with zone-redundant Load Balancer; Azure Backup Center for multi-subscription governance via Azure Policy; RTO/RPO tradeoff comparison: hot standby vs pilot light vs warm standby vs active-active.

📖 Read in-depth chapter ▾

Lesson 4.1 RTO, RPO, and the DR-pattern ladder

Every BC/DR question on AZ-305 hinges on RTO (how long can recovery take?) and RPO (how much data can you afford to lose?). The four DR patterns trade money for those two numbers. Pick the cheapest pattern that meets the requirement.

Key concepts

Backup & restore: cheapest, highest RTO (hours-days), RPO bounded by backup frequency. Suitable for non-critical workloads or data with low business value.
Pilot light: minimal infrastructure kept warm in DR region (DB replicating, network configured, no app servers). RTO measured in hours (start app servers); RPO close to zero for the DB.
Warm standby: scaled-down version of full prod running in DR region. RTO minutes (scale up); RPO close to zero. The "middle option" most exam scenarios expect.
Active-active (multi-region): full prod in both regions, traffic balanced across both. RTO < 1 minute (just stop routing to failed region); RPO close to zero. Most expensive. Required for <1 hour RTO on global services.
Tier-by-tier RTO inheritance: the whole stack's RTO is the SLOWEST tier's RTO. If the DB takes 2 hours to recover, the 5-minute RTO web tier doesn't matter. Design for the slowest dependency.
Geo-redundancy at storage layer: ZRS / GRS / GZRS choices flow from the DR pattern. Active-active requires ZRS or GZRS in BOTH regions. Pilot light can use GRS with read access for the DB tier.

Concrete example

A SaaS platform has RTO 30 minutes / RPO 5 minutes for the customer-facing API. Choice: warm standby. Primary region: ASE running 20 instances. DR region: same ASE running 4 instances behind a Traffic Manager priority routing. Azure SQL with auto-failover group; replication lag < 5 seconds = RPO satisfied. On region outage: Traffic Manager flips to DR endpoint within 30 seconds (DNS TTL), DR app service plan scales out from 4 → 20 in < 15 minutes. Total RTO < 30 minutes.

Key takeaway: RTO/RPO requirements pick the pattern; pattern picks the infrastructure. Inherit RTO from the slowest tier. Active-active only when minutes of RTO matter; otherwise warm standby covers most enterprise needs.

⚡ Mini-quiz

Drill DR-pattern decisions → study mode (10 questions).

Lesson 4.2 Azure Site Recovery and VM replication

ASR is the AZ-305 answer for VM-level DR. It replicates VMs continuously to a secondary region and orchestrates failover via recovery plans. The exam tests the replication topology, recovery plan ordering, and where ASR fits vs Backup.

Key concepts

Continuous replication: ASR streams disk changes to the secondary region's Recovery Services Vault. RPO is generally < 5 minutes for most VM SKUs. Storage in the secondary is Premium / Standard managed disks created at failover, not running before.
Recovery plans: ordered group of VMs (and optional Azure Automation runbooks) that defines failover sequence — start DB tier first, wait for health, then app tier, then web tier. Plans support pre-, post-, and intra-step scripts.
Failover modes: Test failover (isolated VNet, doesn't disrupt primary; quarterly DR drill should always use this), Planned failover (cooperative — primary shuts down cleanly, no data loss), Unplanned failover (primary unreachable, accept the last-replicated state).
Failback: after primary is healthy, ASR can re-replicate from secondary back to primary, then planned-failover back. Validate failback quarterly — many teams test failover but never failback and discover the gap during a real incident.
ASR vs Backup: ASR is "the region is gone, give me a running system". Backup is "I deleted the data, give me a point-in-time copy". They share the Recovery Services Vault but solve different problems — most enterprises run both.
Hyper-V / VMware to Azure: ASR can replicate on-prem to Azure too, using a Configuration Server / Process Server appliance on-prem. The exam tests this less than Azure-to-Azure but it does appear.

Concrete example

A 3-tier app (web / app / SQL on IaaS VMs) needs RTO 60 minutes / RPO 15 minutes cross-region. Solution: ASR replication for all VMs from eastus2 to westus2. Recovery plan: step 1 — domain controllers and SQL VMs; step 2 — app tier (after health check via post-script); step 3 — web tier (after health check); step 4 — Azure Automation runbook to update Traffic Manager weights. Test failover runs quarterly in an isolated VNet for the DR drill, with no impact on production.

Key takeaway: ASR for VM-level cross-region replication. Recovery plans express the tier dependencies. Test failover quarterly, failback annually. ASR and Backup are complementary, not alternatives.

⚡ Mini-quiz

Practise ASR + recovery plan scenarios → quick quiz (5 questions).

Lesson 4.3 Database BC/DR — Azure SQL, Cosmos DB, and managed alternatives

Databases have richer BC/DR primitives than VMs. AZ-305 tests Azure SQL auto-failover groups (regional failover for SQL Database / Managed Instance) and Cosmos DB multi-region writes. Picking the right managed feature beats hand-rolled replication every time.

Key concepts

Azure SQL auto-failover groups: bundle one primary and one secondary Azure SQL DB / Managed Instance in different regions with a single listener DNS endpoint. App connects to the listener; failover (planned or unplanned) flips DNS — clients reconnect transparently.
Failover-group failure modes: automatic (Microsoft initiates after grace period when primary unreachable) or manual. Read-Write listener follows primary; Read-Only listener can route to secondary for read replicas.
Service tier and replication SLA: Business Critical tier offers in-region zone redundancy + cross-region failover groups with RPO < 5 seconds. General Purpose has lower SLA. Hyperscale uses page server replicas — different failover story.
Cosmos DB multi-region writes: enable to make every region a writeable replica. Conflict resolution: last writer wins (default), custom merge procedure (you write JavaScript), or manual via conflicts feed. Active-active across regions for free.
Cosmos DB consistency levels: strong, bounded staleness, session, consistent prefix, eventual. Strong forbids multi-region writes. Session is the SDK default — strongest consistency that still allows multi-region writes for a given partition.
Storage redundancy revisited: for the DB's storage substrate, GZRS / RA-GZRS are the only options that survive an in-region zone outage AND cross-region. Pair with auto-failover groups for a complete picture.

Concrete example

A global SaaS needs < 5-minute RTO and active-active writes. Choice for the OLTP tier: Cosmos DB with multi-region writes enabled in three regions, session consistency, last-writer-wins conflict resolution. Choice for the reporting DB (already running on Azure SQL): Azure SQL Business Critical with auto-failover group primary in westeurope, secondary in northeurope, automatic failover, RPO < 5 seconds. App's connection strings use the failover-group listener for SQL and the global account endpoint for Cosmos.

Key takeaway: Azure SQL auto-failover groups for managed SQL. Cosmos DB multi-region writes for true active-active OLTP. Pick the service's native BC/DR feature; rolling your own with replication and Traffic Manager is almost always wrong on the exam.

⚡ Mini-quiz

Drill DB BC/DR scenarios → study mode (10 questions).

Module 053 lessons

Network & Connectivity Design

Virtual WAN Secured Hub with Azure Firewall for intent-based routing across all connected spokes; ExpressRoute for SLA-backed private connectivity vs Site-to-Site VPN; Traffic Manager routing methods (Performance, Geographic, Weighted) with endpoint health probes; Application Gateway end-to-end SSL with backend HTTPS settings; NSG outbound rules with service tags + VNet Service Endpoints; Private Endpoints for Key Vault, Storage, and SQL with Azure Private DNS Zone integration.

📖 Read in-depth chapter ▾

Lesson 5.1 Hub-and-spoke and Virtual WAN topologies

Azure networking design at scale almost always lands on hub-and-spoke or Virtual WAN. Pick the wrong one and your transit cost explodes or you can't route between regions. AZ-305 tests the topology decision and the configuration flags that make peering / gateway-transit actually work.

Key concepts

Hub-and-spoke (traditional): one hub VNet per region with shared services (firewall, gateway, DNS), spokes peer to the hub. Build it yourself with VNet peering. Cost: per-peering and per-GB egress.
Virtual WAN: Microsoft-managed hub-and-spoke at global scale. Single Virtual WAN resource holds multiple hubs (one per region) and orchestrates peering between them. Supports VPN, ExpressRoute, and Secure Hub (built-in Firewall + Routing Intent).
Gateway transit flag pair: on the hub-side peering, enable Allow gateway transit; on the spoke-side, enable Use remote gateways. Now the spoke uses the hub's VPN/ExpressRoute Gateway. Without it, every spoke needs its own gateway.
Peering non-transitivity: A↔B and B↔C does NOT give A↔C. To allow spoke-to-spoke traffic in hub-and-spoke, deploy an NVA or Azure Firewall in the hub and add UDRs in each spoke. Virtual WAN's Secure Hub does this natively via Routing Intent.
Address space planning: assign a region-wide supernet (e.g., 10.10.0.0/16 for eastus2). Carve spokes from that. Reserve 10.255.0.0/16 (or similar) for the hub's GatewaySubnet / AzureFirewallSubnet / AzureBastionSubnet — those need specific names AND minimum sizes.
When to pick which: < 10 spokes in one region → hand-rolled hub-and-spoke is fine. Multi-region or > 20 spokes → Virtual WAN. Need to integrate SD-WAN appliances → Virtual WAN. Tight cost optimisation in a small footprint → hand-rolled.

Concrete example

A multinational retailer with three regions and 60 spoke VNets total. Choice: Virtual WAN with one hub per region. Each hub is a Secure Hub running Azure Firewall with Routing Intent configured for private + internet inspection. Spokes peer to their regional hub; cross-region traffic flows through the Microsoft backbone between hubs at no per-peering cost. ExpressRoute circuits terminate at one hub per region; failover via Global Reach.

Key takeaway: Virtual WAN for > 1 region or > 20 spokes; hand-rolled hub-and-spoke for smaller scope. Always use Routing Intent / Azure Firewall to route between spokes (peering is non-transitive). Plan address space at region-supernet scope.

⚡ Mini-quiz

Drill hub-and-spoke vs Virtual WAN decisions → study mode (10 questions).

Lesson 5.2 Hybrid connectivity — VPN, ExpressRoute, and SD-WAN

Connecting on-prem to Azure has three primary options: VPN over the public internet, ExpressRoute private circuits, and SD-WAN integration. AZ-305 tests when to pick each based on bandwidth, SLA, latency, and cost.

Key concepts

Site-to-Site VPN: IPsec tunnel over the internet. SKUs VpnGw1 through VpnGw5 scale bandwidth and concurrent tunnels. Active-active across AZs for the gateway 99.95% SLA. Cheap (~$140/mo for VpnGw1) but no bandwidth guarantee.
ExpressRoute: private circuit through a connectivity provider — bypasses the public internet, comes with an SLA (99.9 or 99.95% with redundant circuits). Two peering types: private (Azure VNets), Microsoft (Microsoft 365 / public Azure PaaS over the private circuit).
ExpressRoute Global Reach: connects two ExpressRoute circuits to each other through the Microsoft backbone. Useful for branch-to-branch traffic without backhauling through HQ.
ExpressRoute FastPath: bypasses the ExpressRoute Gateway data path for direct circuit-to-VM traffic. Higher throughput, requires Ultra Performance SKU. Needed for >10 Gbps real-world throughput.
SD-WAN integration: Virtual WAN supports SD-WAN partner appliances (Cisco SD-WAN, Aruba, Versa, etc.) as VNet-deployed CPE. The branch office connects to its SD-WAN appliance, which terminates into the regional Virtual WAN hub. Great for retail / 100+ branches.
Cost vs criticality: VPN for dev/test and small businesses; ExpressRoute for production with bandwidth or SLA requirements; ExpressRoute + VPN failover when ExpressRoute downtime would cost more than the VPN keep-alive.

Concrete example

A bank needs to connect HQ (200 Mbps, sustained) and 40 branch offices to Azure with SLA-backed connectivity. Design: ExpressRoute circuit at HQ (500 Mbps, dual circuits in different ER providers for 99.95% SLA), terminated at the Virtual WAN hub. Branches connect via SD-WAN appliances (Cisco vEdge in Virtual WAN partner pattern) that route into the same hub. Backup VPN tunnels from HQ to Azure as failover for the unlikely dual-ER outage.

Key takeaway: VPN for cheap / low-stakes. ExpressRoute for production with SLA. Global Reach for branch-to-branch. SD-WAN integration when you have many branches. ExpressRoute + VPN failover when SLA matters more than cost.

⚡ Mini-quiz

Practise hybrid connectivity scenarios → quick quiz (5 questions).

Lesson 5.3 Private endpoints, DNS, and global load balancing

The last network design lesson covers the three pieces every modern Azure architecture needs: private endpoints (no public IPs), Azure DNS Private Resolver (hybrid name resolution), and global load balancing (Front Door vs Traffic Manager).

Key concepts

Private endpoints: a NIC in your VNet with a private IP that resolves to a PaaS service (Storage, SQL, Key Vault, App Service, &c.). Disable the public endpoint entirely once private endpoints are in place. The modern AZ-305 default for any PaaS-to-VNet integration.
Azure Private DNS zones: required for name resolution to the private endpoint's IP. Each PaaS service has a documented Private DNS zone name (privatelink.blob.core.windows.net, privatelink.database.windows.net). Link the zone to every VNet that needs to resolve.
Azure DNS Private Resolver: the managed bridge for hybrid DNS. Inbound endpoints answer queries from on-prem; outbound endpoints forward configured zones to on-prem resolvers. Replaces the old DNS-forwarder-VM pattern.
Service endpoints vs private endpoints: service endpoints route traffic to PaaS over the Microsoft backbone but keep the public IP. Private endpoints make the PaaS endpoint private. New designs almost always pick private endpoints.
Traffic Manager (DNS): global load balancer based on DNS resolution. Methods: Performance (lowest latency), Geographic (per source country), Weighted, Priority (active/passive). Failover bounded by client DNS TTL — typically 30s+. Use when sub-second failover isn't required.
Azure Front Door (anycast): global L7 reverse proxy at Microsoft edge POPs. WAF, CDN caching, sub-second failover, URL-based routing. Premium tier adds private origin support (your origins don't need public IPs). The modern default for global L7.
Decision pattern: Regional internal/external → Load Balancer / Application Gateway. Global L7 + WAF + edge caching → Front Door. Global DNS-based without WAF/caching → Traffic Manager. Static assets only → Front Door / CDN.

Concrete example

A global SaaS needs (1) WAF and edge caching at <100ms RTT globally, (2) all PaaS services accessed privately from the app VNet, (3) on-prem clients can resolve Azure Private DNS records. Design: Azure Front Door Premium in front, two regional origin groups, WAF in Prevention mode, caching /static/*. App VNet uses private endpoints for Storage, SQL, Key Vault — all public endpoints disabled. Azure DNS Private Resolver with inbound endpoint exposes the Private DNS zones to on-prem resolvers via DNS conditional forwarders.

Key takeaway: private endpoints + Private DNS zones for all new PaaS integrations. Front Door for global L7 (modern default); Traffic Manager only when DNS-only failover suffices. DNS Private Resolver bridges Azure Private DNS to on-prem.

⚡ Mini-quiz

Drill private endpoint + global LB scenarios → study mode (10 questions).

Module 063 lessons

Compute & Integration Services

Azure Batch for HPC scale-to-zero workloads with spot VM pricing; AKS with KEDA + HPA for mixed scaling microservices; Container Apps for serverless containers with sidecar support and scale-to-zero; Azure Migrate Discovery & Assessment appliance for right-sizing and dependency mapping; Azure Database Migration Service (DMS) online mode for minimal-downtime SQL migrations; API Management (APIM) for rate limiting, JWT validation, developer portal, and product subscriptions; Service Bus message sessions for exactly-once ordered financial transaction processing.

📖 Read in-depth chapter ▾

Lesson 6.1 Choosing compute — VMs, App Service, Containers, Functions

Azure compute is a ladder of abstractions from raw VMs up to serverless functions. AZ-305 asks you to climb the ladder — pick the highest-abstraction option that satisfies the requirement, because each step up cuts operational toil.

Key concepts

Azure VMs: full control + responsibility for OS patching. Use only when you need OS-level access, specific kernels, GPUs not in PaaS, or licensing tied to physical CPU sockets.
App Service: managed web hosting. Standard / Premium / Isolated tiers. Slot swap, autoscale, custom domains, managed identities. Right for stateless web apps, APIs, background jobs (WebJobs).
Azure Container Instances (ACI): single-container serverless runs. Pay per second. Right for batch jobs, ephemeral tasks, virtual-kubelet bursts from AKS.
Container Apps: managed Kubernetes-shaped serverless containers. KEDA-based scaling (including scale-to-zero), Dapr sidecars, revisions for blue/green. The modern "I have a container but don't want full AKS" answer.
AKS: managed Kubernetes — control plane free, pay only for worker nodes. Right when you need K8s primitives, complex orchestration, helm ecosystem, or third-party operators.
Functions: event-driven serverless. Consumption (true serverless), Premium (warm), App Service plan (predictable cost). Right for trigger-driven workflows, glue code, event handlers.
Azure Batch: managed batch + HPC. Scale-to-zero, Spot VM support, low-priority pools. Right for one-off massive parallel jobs (rendering, scientific compute).

Concrete example

A SaaS team has: (1) a long-running web tier (state in DB only), (2) a containerised image-processing service that scales 0 → 200 on bursts, (3) a CSV-to-DB ETL that runs nightly, (4) a research team's nightly molecular-dynamics simulation across hundreds of cores. Choices: (1) App Service Premium V3 with autoscale. (2) Container Apps with KEDA queue-triggered scale 0-200. (3) Function App on Consumption plan, time trigger. (4) Azure Batch low-priority Spot pool — scales to zero between runs.

Key takeaway: climb the ladder. Functions for triggers, Container Apps for scale-to-zero microservices, App Service for stateless web, AKS only when you need K8s, Batch for HPC, VMs only when forced.

⚡ Mini-quiz

Drill compute-selection scenarios → study mode (10 questions).

Lesson 6.2 API Management — façade, security, and developer experience

Azure API Management (APIM) sits in front of your APIs and adds rate limiting, transformation, auth, and a developer portal. AZ-305 tests when APIM is the right choice and which tier / topology fits.

Key concepts

Tier ladder: Consumption (pay-per-call serverless), Developer (non-prod single instance), Basic / Standard / Premium (multi-region, multi-AZ, VNet integration), Isolated (dedicated single-tenant, regulated industries).
Premium-only features: multi-region deployment, VNet integration with internal mode (no public IP), self-hosted gateway (run APIM gateway on-prem or in another cloud), Availability Zone deployment.
Policies: XML pipeline (inbound, backend, outbound, on-error). Rate limit, set-header, set-backend-service, validate-jwt, cache-lookup, mock-response, return-response. The composability is what makes APIM more than just a reverse proxy.
Products and subscriptions: a product is a curated bundle of APIs with a subscription key. Consumers subscribe to products via the developer portal. Per-product rate limits and approval workflows.
Backends: point to App Service, Function App, AKS, on-prem (via VNet), or third-party HTTP endpoints. APIM caches credentials and provides one consistent client-facing surface for many backends.
Self-hosted gateway: the APIM gateway as a container you run anywhere — on-prem, AWS, edge. Control plane in Azure; data plane wherever needed. Solves "API is on-prem but I want APIM features".

Concrete example

A bank exposes 40 internal APIs to external partners with: (1) per-partner rate limits, (2) JWT validation against partner IdPs, (3) audit logging, (4) regulatory requirement that one of the APIs runs in-DC behind a firewall. Choice: APIM Premium with multi-region deployment for SLA, VNet integration in internal mode. Each partner gets a Product wrapping selected APIs, with per-product rate-limit policies. Self-hosted gateway runs on-prem for the regulated API; same APIM management plane handles policy.

Key takeaway: APIM Premium for production multi-region with VNet integration. Self-hosted gateway extends the data plane to on-prem / other clouds. Products + subscriptions = monetisation / partner-tier model.

⚡ Mini-quiz

Practise APIM tier and policy scenarios → quick quiz (5 questions).

Lesson 6.3 Migration and integration — Azure Migrate, DMS, Service Bus

AZ-305 tests migration design more than tooling specifics. Know what each tool does, which one is the right fit, and the order to combine them in a phased migration.

Key concepts

Azure Migrate: the umbrella tool with Discovery & Assessment, Server Migration (powered by ASR), and Database Migration (powered by DMS). Always start here — the discovery appliance maps on-prem dependencies and right-sizes Azure VMs.
Database Migration Service (DMS): two modes — offline (cutover at the end, downtime equal to migration), online (continuous replication, near-zero downtime cutover via DNS flip). Online supports SQL Server → Azure SQL Database / Managed Instance, MySQL, PostgreSQL.
Server Migration: uses ASR-style replication. Same continuous-replication model, planned-failover semantics for the cutover. Right for VM-level migrations from VMware, Hyper-V, AWS, GCP.
Service Bus: enterprise message bus — queues, topics/subscriptions, sessions for ordered processing, dead-letter queues, scheduled messages, duplicate detection. Pick over Event Grid when you need queue-style competing-consumer semantics and ordering.
Event Grid: publish/subscribe for state-change events at massive scale. System topics from Azure services (blob created, VM started), custom topics from your apps, partner topics from SaaS. Push delivery, retries, dead-letter.
Decision pattern: Service Bus when you need ordering / sessions / queues / competing consumers / FIFO. Event Grid when you need pub/sub with massive fan-out. Both can coexist in a single architecture (Service Bus for transactional flows; Event Grid for state-change broadcasts).

Concrete example

A bank migrates a 100-VM on-prem datacentre with an Oracle DB to Azure with near-zero downtime. Plan: phase 1 — Azure Migrate Discovery appliance maps dependencies, right-sizes VMs, identifies the Oracle DB as the integration bottleneck. Phase 2 — Server Migration replicates VMs continuously, app team validates in a test failover. Phase 3 — DMS online mode replicates Oracle → Azure Database for PostgreSQL (with schema conversion via SSMA). Cutover: short read-only window, DMS final sync, DNS flip to the Azure endpoints. Total cutover downtime: < 10 minutes.

Key takeaway: Azure Migrate first for discovery + right-sizing. DMS online for DB cutover with minimal downtime. Service Bus for ordered/queue messaging, Event Grid for pub/sub state changes; they coexist.

⚡ Mini-quiz

Drill migration + messaging scenarios → study mode (10 questions).

Module 073 lessons

Advanced Architecture Patterns

Managed Identity (system-assigned vs user-assigned) for zero-credential service-to-service auth; Azure Front Door + Cosmos DB multi-master for active-active global deployments; Azure Blueprints for subscription-level governance scaffolding with versioned artifacts; Data Box / Data Box Heavy for offline petabyte-scale data transfer; App Service HttpQueueLength autoscale for queue-based scaling beyond CPU; Event Grid for event-driven blob processing and pub/sub patterns; Azure Blueprints vs Terraform vs Azure Policy: governance tool selection criteria.

📖 Read in-depth chapter ▾

Lesson 7.1 Managed identities and zero-credential service auth

Hard-coded secrets are the easiest credential-theft path. Managed identities eliminate them — Azure issues short-lived tokens to your resource automatically. AZ-305 expects you to design every service-to-service auth around managed identities by default.

Key concepts

System-assigned MI: tied to the resource's lifecycle — created/deleted with the resource. Use when one identity per resource is the right scope. The default for most cases.
User-assigned MI: standalone resource you create, assign to multiple resources. Use when many resources share one role (e.g., a fleet of VMs all accessing the same Key Vault) or when you need to grant access BEFORE the resource exists.
Federated identity credentials: let an external workload (GitHub Actions, AWS, Kubernetes via OIDC) trade its identity for an Azure token without secrets. Right for CI/CD pipelines connecting to Azure.
RBAC + Key Vault references: grant the MI a data-plane role on the target (Storage Blob Data Reader, SQL DB Contributor, Key Vault Secrets User). In App Service / Functions, use @Microsoft.KeyVault(SecretUri=...) in app settings — the runtime fetches with the MI; no code changes.
Service Connector: newer abstraction that wires identity + connection for App Service / Container Apps / Functions → DB / Storage / Cache. Reduces the boilerplate of provisioning + permission + config string.
Audit trail: every MI token request is logged in the resource's Activity Log and (if signed-in) in Entra ID sign-in logs. Forensic-ready by default — vs hard-coded secrets which leave no audit trail when reused.

Concrete example

A web app on App Service needs to read secrets from Key Vault, write blobs to Storage, and connect to Azure SQL DB. Design: enable system-assigned managed identity on the App Service. Grant: Key Vault Secrets User on the vault, Storage Blob Data Contributor on the storage account, db_datareader + db_datawriter mapped to the MI on the SQL DB. App settings reference Key Vault secrets via @Microsoft.KeyVault URIs. No connection strings with passwords anywhere. Pipeline uses federated identity credentials from GitHub Actions OIDC for deploys.

Key takeaway: system-assigned for "one identity per resource", user-assigned for "one identity across many resources or pre-existing". Federated identity credentials for non-Azure CI/CD. Eliminate every connection string with a password.

⚡ Mini-quiz

Drill managed-identity scenarios → study mode (10 questions).

Lesson 7.2 IaC at scale — Bicep, ARM, Terraform, Deployment Stacks

AZ-305 covers IaC at the architectural level — which tool fits which org, how to gate deploys, how to handle drift. Picking the right tool is as much culture as technology.

Key concepts

ARM JSON: the legacy template format. Verbose, ugly, but the substrate everything else compiles to. Read-only in practice — you'll rarely hand-write ARM in 2026.
Bicep: Microsoft's DSL over ARM. Cleaner syntax, modules, conditional resources, loops. Transpiles to ARM. Free, first-class Azure support. The native AZ-305 IaC answer.
Terraform: HashiCorp's multi-cloud IaC. Same use cases as Bicep for Azure-only, plus consistent tooling if your org runs multi-cloud. Use when the org already has Terraform skills.
Deployment Stacks: a managed deployment with deny assignments — Azure prevents resources owned by the stack from being modified or deleted outside the stack. Replaces Azure Blueprints. Critical for governance scaffolding.
Template Specs: versioned Bicep/ARM templates stored as Azure resources. Share approved templates centrally; consumers deploy by reference. Used in landing-zone patterns to publish a "compliant network" template that BU teams reuse.
Drift detection: Bicep / Terraform plan shows the diff between declared state and actual. Azure Policy (with Audit effect) detects post-deploy configuration changes. Combine — IaC for greenfield + Policy for ongoing compliance enforcement.

Concrete example

An enterprise builds a landing zone — repeatable subscription scaffolding with mandatory policies, hub VNet, monitoring workspace. Old way: Azure Blueprints. Modern way: Bicep modules published as Template Specs, deployed via Deployment Stacks with denySettings: denyDelete so BU teams can't accidentally delete the shared resources. Azure Policy initiative attached at the MG above the subscription enforces ongoing compliance (no public IPs without exemption, etc.). Drift detection via Policy audit + monthly Bicep diff reports.

Key takeaway: Bicep + Template Specs + Deployment Stacks is the modern AZ-305 governance stack. Terraform when the org has the skills and multi-cloud. Blueprints is legacy — call it out but use Deployment Stacks for new builds.

⚡ Mini-quiz

Practise IaC tooling selection → quick quiz (5 questions).

Lesson 7.3 Cost optimization, FinOps, and right-sizing

The final design domain on AZ-305 is FinOps — designing for cost is as architecturally important as designing for availability. The exam tests cost-management primitives, commitment-based discounts, and the trade-offs of cost-optimised architectures.

Key concepts

Reservations: 1- or 3-year commitments on specific SKUs (VMs, SQL DB, Cosmos DB, App Service, Synapse). 30-65% discount. Use for steady-state predictable workloads. Convertible reservations let you exchange SKUs.
Savings Plans (compute): 1- or 3-year commitment to a $/hour spend across compute (VMs, App Service, Container Instances, Functions Premium). More flexible than VM Reservations — covers SKU changes automatically.
Spot VMs: up to 90% discount, can be evicted with 30-second notice. Right for fault-tolerant batch / dev-test / stateless workloads. Combine with VMSS for graceful eviction handling.
Azure Hybrid Benefit: use existing Windows Server / SQL Server licenses on Azure to skip the per-OS cost. Major saving on lift-and-shift migrations.
Cost analysis + Budgets: Cost Management gives you slice-by-tag analytics and per-subscription budgets with threshold alerts. AZ-305 expects you to design "tag-everything" from day 1 so chargeback works.
FinOps tooling integration: export Cost Management data to Storage → ingest into Power BI or third-party (CloudHealth, Apptio). Combine actual cost with capacity / utilisation metrics to find under-used resources for right-sizing.
Architecture trade-offs: serverless (Consumption Functions, Container Apps scale-to-zero) trades latency for cost at low utilisation. PaaS (App Service) is a middle ground. IaaS (VMs) is cheapest at high steady utilisation if you've also bought reservations.

Concrete example

A 24/7 SaaS with predictable baseline + 4× peak. Cost design: baseline VMs covered by a 3-year Savings Plan for the steady tier (~50% discount). Peak handled by Spot VMSS instances (90% discount, accept eviction during scale-down). Windows licenses come from on-prem via Azure Hybrid Benefit. Cost Management budgets at the subscription level with 80%/100% alerts; tagging policy enforces CostCenter + Environment + Owner on every resource via Modify policy effect. Monthly Power BI report cross-references cost with VM Insights utilisation to flag right-sizing candidates.

Key takeaway: Reservations / Savings Plans for steady baseline, Spot for fault-tolerant burst, Hybrid Benefit for license re-use. Tag from day 1 — chargeback is impossible without it. Right-size on data, not gut feel.

⚡ Mini-quiz

Drill cost-optimisation scenarios → study mode (10 questions).

Test your AZ-305 knowledge

60 scenario-based questions covering all 4 domains. No signup required.

⚡ Start practice test ▶ Podcast

Key AZ-305 concepts to master

Governance trap

Management Groups vs Azure Policy scope

Azure Policy assigned at the Management Group root cascades to ALL subscriptions underneath — including future subscriptions. Many candidates confuse Azure Policy (enforcement/compliance) with Azure Blueprints (deployment scaffolding) and Azure RBAC (access control). These are three distinct tools. A single Azure Policy at the root MG replaces the need to configure identical policies in each of 80+ subscriptions manually.

HA design trap

Availability Sets vs Availability Zones

Availability Sets protect against rack/hardware failure within a single datacenter — useful when a zone-redundant SKU isn't available. Availability Zones are physically separate datacenters with independent power and networking — they protect against full datacenter failure. The AZ-305 exam frequently tests this distinction. For new greenfield deployments, Availability Zones (with zone-redundant Load Balancer + VMSS) is always the preferred answer over Availability Sets.

Data storage trap

Cosmos DB consistency levels & multi-master write

For active-active global deployments, Cosmos DB with multi-region write enabled allows writes to any region. The consistency level selection matters: Strong guarantees linearizability but incurs cross-region latency. Bounded Staleness or Session is preferred for global apps. SQL Database Active Geo-Replication creates read-only secondaries — you cannot write to secondary regions, making it unsuitable for true active-active patterns.

6-week study plan

Week 1

Governance & Identity foundations. Management Groups, Azure Policy effects (Deny/Audit/Modify), RBAC roles (Owner vs Contributor), PIM eligible assignments, resource locks. Do the 15 identity/governance practice questions, review wrong answers.

Week 2

Hybrid identity & monitoring. PTA vs PHS vs ADFS scenarios. Azure Monitor Agent, DCRs, Application Insights. Log Analytics retention tiers and cost optimization. Study dual-destination Diagnostic Settings pattern.

Week 3

Data storage design. Blob lifecycle tiering policies. Cosmos DB consistency + multi-master. SQL Serverless auto-pause. ADLS Gen2 vs Blob Storage. Azure Files Premium with AD auth. Do all storage practice questions.

Week 4

BCDR strategies. Azure Site Recovery (RPO/RTO numbers). SQL Auto-Failover Groups + Business Critical tier. VMSS across 3 AZs. Backup Center + Azure Policy. Memorize the RTO/RPO tier comparison table.

Week 5

Infrastructure & networking. Virtual WAN Secured Hub vs hub-and-spoke with UDRs. ExpressRoute vs VPN scenarios. Traffic Manager routing methods. App Gateway end-to-end SSL. Private Endpoints vs Service Endpoints. Complete full 60-question mock.

Week 6

Weak spot review & exam readiness. Re-do missed questions. Review compute choices (AKS vs Container Apps vs Functions vs VMs). Practice case study format. Focus on cost optimization answers. Take the full mock twice and aim 85%+.

Top 4 reasons candidates fail AZ-305

Confusing governance tools: Azure Policy (enforce standards), Blueprints (subscription scaffolding), RBAC (access), and Locks (deletion prevention) — all serve different purposes. The exam writes scenarios where using the wrong tool compiles correctly but doesn't meet the stated requirement.
Mixing up HA tiers: Availability Sets ≠ Availability Zones. Geo-Replication ≠ Auto-Failover Groups. Active-active ≠ active-passive. Memorize what each provides (datacenter, region-level) and its RTO/RPO characteristics.
Cost optimization blind spots: Overlooking Blob lifecycle policies, Log Analytics Basic plan, SQL Serverless auto-pause, and Azure Batch scale-to-zero. AZ-305 has a dedicated cost optimization thread woven through every domain.
Weak on Managed Identity: Many candidates default to connection strings and SAS tokens when Managed Identity is the zero-credential, least-privilege answer. The exam explicitly rewards this pattern for VM→Key Vault, ADF→ADLS Gen2, and AML→Storage scenarios.

AZ-305 vs AZ-104: What’s different?

AZ-104 (Azure Administrator) tests how to configure Azure services — deploying VMs, setting RBAC assignments, configuring storage. AZ-305 (Azure Solutions Architect Expert) tests how to design — which service combination best meets business requirements, cost constraints, and SLA targets.

AZ-104 is required before taking AZ-305. The architect exam assumes you can implement; it focuses on justifying architectural decisions under constraints: budget, compliance, RTO/RPO targets, team skill sets, and existing investments. Case study questions test multi-service design holistically.

AZ-305 exam domain weights

Course modules

Test your AZ-305 knowledge

Key AZ-305 concepts to master

Management Groups vs Azure Policy scope

Availability Sets vs Availability Zones

Cosmos DB consistency levels & multi-master write

6-week study plan

Top 4 reasons candidates fail AZ-305

AZ-305 vs AZ-104: What’s different?

Related certifications