| Detail | Info |
|---|---|
| Exam code | AZ-305 |
| Full name | Designing Microsoft Azure Infrastructure Solutions |
| Questions | 40–60 questions (case studies, scenario-based, drag-and-drop) |
| Passing score | 700 / 1000 |
| Duration | 120 minutes |
| Cost | $165 USD |
| Prerequisite | AZ-104 Azure Administrator (required) |
| Renewal | Annual free online assessment |
AZ-305 exam domain weights
Course modules
📖 Read in-depth chapter ▾
AZ-305 is a design exam, so questions on governance test your topology skills more than your config. Management Group hierarchy is the canonical "draw the org chart" question that opens most exams. Pick the right shape on day one or unwind it later under pressure.
- Tenant Root Group: the implicit ceiling of every tenant. Policies and role assignments here apply to every subscription. Don't attach RBAC here unless you really mean "everyone, everywhere".
- Recommended pattern: Tenant Root → Platform / Landing Zones / Sandbox / Decommissioned (functional grouping under root), with subscriptions placed by lifecycle stage and workload type. Cloud Adoption Framework (CAF) ships this layout as a starting point.
- Subscription as billing AND blast-radius boundary: spending limits, quota limits, and policy scope all stop at the subscription. Greenfield landing zones almost always carve one subscription per business unit per environment (prod/non-prod) to bound blast radius.
- Move costs: moving subscriptions between management groups is cheap, but moving them between tenants is brutal (migration, not move). Get the tenant right at the start — multi-tenant designs are an AZ-305 anti-pattern unless legally required.
- Empty MGs as policy anchors: hold a future-state subscription scope by creating an empty MG that already has policy attached. New subscriptions are placed there and inherit the policy instantly, no retro work.
An enterprise with 40 business units wants central platform services (network, security, identity) separate from workload subscriptions, and a sandbox for experimentation that is policy-isolated. Design: under Tenant Root, four MGs — Platform (connectivity / management / identity subs), Landing Zones (split into Corp and Online sub-MGs), Sandbox, Decommissioned. Per-BU MGs under Landing Zones. Production-grade policies attach at Landing Zones; relaxed policies at Sandbox. Decommissioned MG has Deny-all so accidentally-leaving subscriptions can't be used.
AZ-305 expects you to pick the right policy effect for the right control. Each effect has different ergonomics and different remediation behaviour. Pick wrong and you either fail to enforce (Audit when you needed Deny) or break legitimate workflows (Deny when you needed Audit).
- Effects: Deny (block create/update at API), Audit (log non-compliance, allow it), Append (add missing fields at create), Modify (add/remove/replace tags or properties, supports remediation), DeployIfNotExists (deploy missing resource — e.g., Log Analytics agent), AuditIfNotExists (log if a related resource is missing), Disabled.
- Initiatives (policy sets): bundle multiple policies and assign together. Built-in initiatives like Azure Security Benchmark ship hundreds of controls; you tweak parameters per assignment. Tracking compliance at initiative level is easier than per-policy.
- Assignment scope and exclusions: assign at MG/subscription/RG, exclude specific child scopes via
notScopes. Exclusions are surgical — exclude a single resource that needs the exception without disabling the policy globally. - Parameters: almost every built-in policy is parameterised. Same definition can be assigned with different allowed-locations lists per region or different required-tag values per BU.
- Remediation tasks: for Modify and DeployIfNotExists effects. Existing non-compliant resources need a remediation task to be brought into line. The task runs under the assignment's managed identity.
A regulated workload must (1) deploy only to eastus2 or westus2, (2) carry the CostCenter tag, (3) auto-deploy the Microsoft Defender for Cloud agent. Solution: an initiative containing the built-in Allowed locations policy (Deny), the Require a tag policy (Deny), and the Configure ASC monitoring agent policy (DeployIfNotExists with remediation managed identity). Assign at the regulated workload's MG. Run a remediation task to bring existing resources into compliance.
Privileged Identity Management (PIM) turns standing access into just-in-time elevation. Custom roles fill the gaps in built-in role coverage. Together they let you satisfy enterprise audit requirements without making people miserable.
- PIM eligibility vs active assignment: Eligible = "can request activation, gets the role for X hours". Active = "has the role right now". Eligible-only design eliminates standing high-privilege access.
- Approval workflow: activation requires approval from a configured set of approvers. Combine with MFA, justification, ticket number requirement. Audit log captures everything for compliance.
- Access reviews: recurring reviews where designated reviewers attest that listed users still need their roles. Auto-removes users who don't respond or whose reviewer denies. Required for ISO 27001 / SOC 2 evidence.
- Custom roles: JSON definition with
actions,notActions,dataActions,notDataActions, and a list ofassignableScopes. Use when no built-in matches — e.g., "restart VMs but never delete them", "read Key Vault secrets but never list them". - Identity-design trade-offs: hybrid identity has three sync modes — Password Hash Sync (PHS) (simple, default), Pass-Through Auth (PTA) (auth happens on-prem), Federation (ADFS) (legacy, complex). AZ-305 favours PHS for most scenarios; PTA only when on-prem must validate every auth; federation only when a hard requirement (smart cards, etc.) exists.
- Conditional Access: AZ-500 territory but appears on AZ-305 in design questions. Build policies around user/group, location, app, sign-in risk, device state. Grant controls (require MFA, compliant device) and session controls (session lifetime, app enforced restrictions).
A regulated bank wants: (1) no standing access to subscription-level Owner roles, (2) MFA required for elevation, (3) quarterly access reviews on all Owner-eligible users. Design: assign Owner via PIM Eligible, set activation max 4 hours with MFA + approval + justification required. Configure quarterly access reviews on the Eligible assignments, with the manager-of as reviewer and "automatic removal if no response" turned on. Pair with a custom role SubscriptionViewer for the 90% of users who only need read access.
📖 Read in-depth chapter ▾
Hybrid identity is the bridge between on-prem Active Directory and Microsoft Entra ID. Three sync modes have different security postures, operational costs, and failure characteristics. The exam asks you to map a compliance requirement to the right mode.
- Password Hash Sync (PHS): Entra Connect syncs a hash of the on-prem password hash to Entra ID. Users authenticate at Entra; on-prem doesn't need to be reachable. Simplest, default, supports Entra ID smart features (leaked-credential detection, Conditional Access). Most teams should pick this.
- Pass-Through Authentication (PTA): Entra Connect installs lightweight agents on-prem that validate passwords against on-prem AD on every sign-in. Password never leaves on-prem. Use when policy requires the password validation to happen on-prem. Requires multiple agents for HA.
- Federation (ADFS / third-party): Entra ID redirects auth to on-prem ADFS or a SAML IdP. Most operationally expensive (own ADFS farm) and most fragile. Used only when smart cards, third-party MFA, or specific token claims are non-negotiable.
- Seamless SSO: can be added to PHS or PTA — Kerberos-based, no add'l prompt for users already signed into the on-prem domain. Federation has SSO inherent.
- Entra Cloud Sync: newer lightweight alternative to Entra Connect for simple sync scenarios. Supports multi-forest from a single sync engine. Doesn't yet match Connect's full feature set; check current docs before committing.
- Disaster scenarios: PHS keeps auth working even when on-prem is offline. PTA fails closed. Federation fails closed AND requires complex Multi-Site / WAP farm DR. PHS is the cheapest "still works when AD is down" option.
A 5,000-user firm migrates to Microsoft 365 and Entra ID. Compliance dictates "password validation must happen on-prem". Security wants leaked-credential detection. Both are required. Choice: PTA + PHS together — PTA validates the actual sign-in; PHS sends hashes for leaked-credential detection (which compares to Microsoft's known-bad list, not as primary auth). Add Seamless SSO for domain-joined laptops. Federation explicitly avoided as overkill.
AZ-305's monitoring questions are about architecture, not config. Pick the right workspace topology, the right collection mechanism, and the right retention tier — get any wrong and you either lose evidence or pay ten times the necessary cost.
- Log Analytics workspace topology: one workspace per region per data-sovereignty boundary is the common pattern. Resist the "one global workspace" temptation — egress to another region costs more than per-workspace overhead.
- Azure Monitor Agent (AMA): the modern unified agent (replacing the legacy Log Analytics Agent / OMS Agent). Targets VMs / VMSS / Arc-enabled servers via association with Data Collection Rules.
- Data Collection Rules (DCR): declarative resource that specifies WHAT to collect (perf counters, syslog, custom logs) and WHERE to send it. Multiple DCRs can target the same VM via Data Collection Rule Associations (DCRA). Replaces per-workspace agent config of the old model.
- Retention tiers: Analytics (full-feature, queryable in the standard UI, expensive) — 30/60/90 days standard. Basic (limited query, cheaper, KQL
search+ filters only) — useful for long-tail debug data. Archive (long-term, restore-to-query) — for compliance retention. - Dual-destination pattern: send the same data to BOTH a Log Analytics workspace (for queryable hot data, 30-day retention) AND a Storage Account (long-term archive at 1/10th the cost). DCRs make this straightforward.
- Application Insights: APM + distributed tracing for apps. Workspace-based mode is the only supported option now — the AI data lives in your Log Analytics workspace. Used for end-to-end transaction tracing, dependency maps, and live-traffic exception streaming.
A global retailer with strict 7-year audit-log retention and a 90-day operational retention. Design: one Log Analytics workspace per region for operational data (90-day retention on Analytics tier). For audit logs: separate Diagnostic Setting destinations on each workload — Activity Log + key resources stream to BOTH the regional workspace AND a centralised Storage Account with immutability locked for 7 years. AMA + DCRs handle the VM-side collection; Application Insights covers the application tier with workspace-based mode.
Logs only matter if they fire alerts. AZ-305 tests your design judgment on which alert type for which condition, and on whether Sentinel belongs in the architecture.
- Alert rule types: Metric (numeric time series), Log query (KQL on Log Analytics), Activity log (control-plane events), Smart Detector (Application Insights anomalies). Pick by what your condition can be expressed as.
- Action groups: reusable notification bundles — email, SMS, voice call, push, webhook, Logic App, Azure Function, ITSM. One action group used by many alerts; updating the on-call rotation is one operation, not 50.
- Action rules: override notification behaviour by scope/filter — suppress all alerts during a maintenance window, change severity, route by tag.
- Workbooks and dashboards: Workbooks are interactive parameterised reports built on KQL. Dashboards aggregate metric tiles, log charts, and pinned visuals into a single view. Both deploy as ARM resources alongside the workload.
- Microsoft Sentinel: SIEM + SOAR built on top of Log Analytics. Adds analytics rules (KQL detections), incidents, automation playbooks (Logic Apps), and threat intelligence. Use when security ops needs queue-style incident management — not just "alert someone".
- Defender for Cloud vs Sentinel: Defender is the CSPM + workload-protection layer (vulnerabilities, misconfig, agent-based runtime protection). Sentinel is the SIEM. Most enterprises run both — Defender feeds findings into Sentinel as alerts.
A regulated workload needs: (1) page on-call when VM CPU > 90% for 10 minutes, (2) page security when a privileged-account sign-in happens from an unusual location, (3) SOC analyst workflow for security incidents. Design: (1) Metric alert on the VM Insights CPU metric → action group ag-oncall (PagerDuty webhook). (2) Log query alert on a SigninLogs KQL query → action group ag-security + auto-create incident in Microsoft Sentinel. (3) Sentinel analytics rule with the same logic + playbook to auto-disable the user via Logic App on critical risk. SOC works the Sentinel incidents queue.
📖 Read in-depth chapter ▾
AZ-305 doesn't ask "which one is fastest"; it asks "which one fits these constraints". The exam loves scenarios where you must match data shape, access pattern, latency, cost, and consistency to the right Azure data service. Memorise the decision tree.
- Relational, OLTP: Azure SQL Database for fully managed (DTU or vCore, Serverless option for dev/test). Azure SQL Managed Instance for migrating SQL Server with SQL Agent, cross-DB queries, near-100% SQL Server feature compatibility.
- NoSQL document / KV / graph: Cosmos DB across five APIs (Core SQL, MongoDB, Cassandra, Gremlin, Table). Globally distributed, multi-region writes, RU-based capacity, five consistency levels.
- Analytical / big data: ADLS Gen2 (hierarchical namespace on Blob — POSIX-like ACLs, ideal for Spark/Hadoop), Azure Synapse Analytics (data warehouse + Spark + Pipelines in one), Databricks (managed Spark).
- Object storage: Blob Storage for unstructured. Hot / Cool / Cold / Archive tiers. Use Blob lifecycle policies to age data through tiers automatically.
- File shares: Azure Files for SMB/NFS shared file systems — Premium tier for low-latency, AD DS integration for kerberos auth. Azure NetApp Files for high-IOPS / enterprise NAS workloads.
- Cache: Azure Cache for Redis — Premium tier supports VNet integration, clustering, geo-replication.
- Streaming + IoT: IoT Hub (millions of devices), Event Hubs (high-throughput ingestion), Stream Analytics (T-SQL on streams).
A retailer's data platform: (1) the catalog (1M products, looked up by SKU, <10ms target) → Cosmos DB Core SQL with session consistency. (2) The transaction store (orders, foreign keys, ACID) → Azure SQL Database Business Critical. (3) Cold order history (5+ years, compliance retention) → Blob Storage with lifecycle policy Hot → Cool at 30 days → Archive at 1 year. (4) Real-time clickstream → Event Hubs → Stream Analytics → Synapse for analytics. Five purpose-built services, none of them stretched outside their sweet spot.
Cosmos DB shows up in nearly every AZ-305 exam. The defining decisions are partition key, throughput model (provisioned RUs vs serverless vs autoscale), and consistency level. Get these wrong and you either overpay massively or hit throttling under load.
- Partition key: chosen at container creation, immutable. Drives physical partition placement. Pick a property with high cardinality and even access distribution — same rules as DynamoDB. Hot partitions cap your effective throughput at 10k RUs regardless of total provisioned.
- Request Units (RU): abstract throughput unit covering CPU + IOPS + memory. 1 KB point read = 1 RU; 1 KB write = ~5 RU. Provisioned throughput pays for reserved RU/s; serverless pays per million RU consumed; autoscale scales between min and max (paying for max consumed in any hour).
- Throughput modes: Provisioned for predictable steady traffic. Serverless for <5,000 RU bursts. Autoscale for variable traffic where peak <10× minimum. Switch between provisioned and autoscale freely; serverless is a separate account type.
- Consistency levels: Strong (linearisable, no multi-region writes), Bounded staleness (read lags by <K writes or T seconds), Session (read your own writes — default for SDK), Consistent prefix, Eventual. Trading off stronger consistency for higher latency.
- Multi-region writes: every region is writeable. Conflict resolution by last-writer-wins (default), custom UDF, or conflicts feed for manual handling. Required for active-active geo deployments.
- Indexing policy: by default everything is indexed (expensive on writes). Customise to exclude properties you never query on — significant write-RU savings on write-heavy workloads.
A global IoT platform stores 10M device records, accessed by deviceId 99% of the time and by region 1% of the time. Design: container with partitionKey: /deviceId (10M distinct values — perfect cardinality, even distribution). Throughput: autoscale 4,000-40,000 RU/s (10× variance). Consistency: session (SDK default, sufficient for IoT use case). Indexing policy excludes the verbose telemetry payload property — saves ~30% on write RUs. Multi-region writes enabled in three regions for active-active. Region queries handled by a separate GSI-style approach with a synced read-only collection partitioned by region.
The "storage detail" questions on AZ-305 are about durability options (LRS/ZRS/GRS/GZRS), automated lifecycle tiering, and access patterns (SAS / RBAC / private endpoints). Get the redundancy choice right based on RPO requirement; let lifecycle policies do the cost optimisation.
- Redundancy options: LRS (3 copies in one DC, cheapest), ZRS (3 copies across 3 AZs, survives AZ outage), GRS (LRS + async geo-replication to paired region), GZRS (ZRS + geo). Read-Access variants (RA-GRS, RA-GZRS) make the secondary readable.
- Storage account kinds: StorageV2 (GPv2, modern default), BlockBlobStorage (Premium SSD-backed, low-latency for blobs), FileStorage (Premium for Files). Skip GPv1 — it can't do access tiers.
- Blob lifecycle policies: JSON rules on a storage account that transition blobs between tiers and delete them based on age or last access. Daily evaluation, no extra cost. Standard pattern: Hot → Cool at 30d, Cool → Archive at 90d, delete at 7y.
- Immutable storage (WORM): time-based retention policy or legal hold. Once locked, even the owner can't delete until the policy expires. Required for SEC 17a-4 / FINRA compliance.
- Access patterns: SAS (account / service / user-delegation — prefer user-delegation, key-less and revocable), RBAC on data plane (Storage Blob Data Reader / Contributor — for service principals and managed identities), private endpoints (subnet-attached private IP — modern default for VNet-attached workloads), storage firewall (IP + VNet rules with default-deny).
- Encryption: at rest by default (Microsoft-managed key). Customer-managed keys (CMK) for compliance — key stored in Azure Key Vault, audited rotation. Encryption in transit via "Secure transfer required" (TLS 1.2 minimum).
A media archive holds petabytes of video files with 7-year retention. Access pattern: heavy in first 30 days, sporadic for 90 days, near-zero after. Design: StorageV2 with GZRS redundancy (survives region outage). Lifecycle policy: Hot → Cool at 30 days → Cold at 90 → Archive at 180 → delete at 7 years. Time-based immutability (locked 7y) on the legal-hold container. Access via private endpoints from the video-processing VNet; public endpoint disabled. Application identities use RBAC on data plane; external auditors get scoped user-delegation SAS tokens, 1-hour expiry.
📖 Read in-depth chapter ▾
Every BC/DR question on AZ-305 hinges on RTO (how long can recovery take?) and RPO (how much data can you afford to lose?). The four DR patterns trade money for those two numbers. Pick the cheapest pattern that meets the requirement.
- Backup & restore: cheapest, highest RTO (hours-days), RPO bounded by backup frequency. Suitable for non-critical workloads or data with low business value.
- Pilot light: minimal infrastructure kept warm in DR region (DB replicating, network configured, no app servers). RTO measured in hours (start app servers); RPO close to zero for the DB.
- Warm standby: scaled-down version of full prod running in DR region. RTO minutes (scale up); RPO close to zero. The "middle option" most exam scenarios expect.
- Active-active (multi-region): full prod in both regions, traffic balanced across both. RTO < 1 minute (just stop routing to failed region); RPO close to zero. Most expensive. Required for <1 hour RTO on global services.
- Tier-by-tier RTO inheritance: the whole stack's RTO is the SLOWEST tier's RTO. If the DB takes 2 hours to recover, the 5-minute RTO web tier doesn't matter. Design for the slowest dependency.
- Geo-redundancy at storage layer: ZRS / GRS / GZRS choices flow from the DR pattern. Active-active requires ZRS or GZRS in BOTH regions. Pilot light can use GRS with read access for the DB tier.
A SaaS platform has RTO 30 minutes / RPO 5 minutes for the customer-facing API. Choice: warm standby. Primary region: ASE running 20 instances. DR region: same ASE running 4 instances behind a Traffic Manager priority routing. Azure SQL with auto-failover group; replication lag < 5 seconds = RPO satisfied. On region outage: Traffic Manager flips to DR endpoint within 30 seconds (DNS TTL), DR app service plan scales out from 4 → 20 in < 15 minutes. Total RTO < 30 minutes.
ASR is the AZ-305 answer for VM-level DR. It replicates VMs continuously to a secondary region and orchestrates failover via recovery plans. The exam tests the replication topology, recovery plan ordering, and where ASR fits vs Backup.
- Continuous replication: ASR streams disk changes to the secondary region's Recovery Services Vault. RPO is generally < 5 minutes for most VM SKUs. Storage in the secondary is Premium / Standard managed disks created at failover, not running before.
- Recovery plans: ordered group of VMs (and optional Azure Automation runbooks) that defines failover sequence — start DB tier first, wait for health, then app tier, then web tier. Plans support pre-, post-, and intra-step scripts.
- Failover modes: Test failover (isolated VNet, doesn't disrupt primary; quarterly DR drill should always use this), Planned failover (cooperative — primary shuts down cleanly, no data loss), Unplanned failover (primary unreachable, accept the last-replicated state).
- Failback: after primary is healthy, ASR can re-replicate from secondary back to primary, then planned-failover back. Validate failback quarterly — many teams test failover but never failback and discover the gap during a real incident.
- ASR vs Backup: ASR is "the region is gone, give me a running system". Backup is "I deleted the data, give me a point-in-time copy". They share the Recovery Services Vault but solve different problems — most enterprises run both.
- Hyper-V / VMware to Azure: ASR can replicate on-prem to Azure too, using a Configuration Server / Process Server appliance on-prem. The exam tests this less than Azure-to-Azure but it does appear.
A 3-tier app (web / app / SQL on IaaS VMs) needs RTO 60 minutes / RPO 15 minutes cross-region. Solution: ASR replication for all VMs from eastus2 to westus2. Recovery plan: step 1 — domain controllers and SQL VMs; step 2 — app tier (after health check via post-script); step 3 — web tier (after health check); step 4 — Azure Automation runbook to update Traffic Manager weights. Test failover runs quarterly in an isolated VNet for the DR drill, with no impact on production.
Databases have richer BC/DR primitives than VMs. AZ-305 tests Azure SQL auto-failover groups (regional failover for SQL Database / Managed Instance) and Cosmos DB multi-region writes. Picking the right managed feature beats hand-rolled replication every time.
- Azure SQL auto-failover groups: bundle one primary and one secondary Azure SQL DB / Managed Instance in different regions with a single listener DNS endpoint. App connects to the listener; failover (planned or unplanned) flips DNS — clients reconnect transparently.
- Failover-group failure modes: automatic (Microsoft initiates after grace period when primary unreachable) or manual. Read-Write listener follows primary; Read-Only listener can route to secondary for read replicas.
- Service tier and replication SLA: Business Critical tier offers in-region zone redundancy + cross-region failover groups with RPO < 5 seconds. General Purpose has lower SLA. Hyperscale uses page server replicas — different failover story.
- Cosmos DB multi-region writes: enable to make every region a writeable replica. Conflict resolution: last writer wins (default), custom merge procedure (you write JavaScript), or manual via conflicts feed. Active-active across regions for free.
- Cosmos DB consistency levels: strong, bounded staleness, session, consistent prefix, eventual. Strong forbids multi-region writes. Session is the SDK default — strongest consistency that still allows multi-region writes for a given partition.
- Storage redundancy revisited: for the DB's storage substrate, GZRS / RA-GZRS are the only options that survive an in-region zone outage AND cross-region. Pair with auto-failover groups for a complete picture.
A global SaaS needs < 5-minute RTO and active-active writes. Choice for the OLTP tier: Cosmos DB with multi-region writes enabled in three regions, session consistency, last-writer-wins conflict resolution. Choice for the reporting DB (already running on Azure SQL): Azure SQL Business Critical with auto-failover group primary in westeurope, secondary in northeurope, automatic failover, RPO < 5 seconds. App's connection strings use the failover-group listener for SQL and the global account endpoint for Cosmos.
📖 Read in-depth chapter ▾
Azure networking design at scale almost always lands on hub-and-spoke or Virtual WAN. Pick the wrong one and your transit cost explodes or you can't route between regions. AZ-305 tests the topology decision and the configuration flags that make peering / gateway-transit actually work.
- Hub-and-spoke (traditional): one hub VNet per region with shared services (firewall, gateway, DNS), spokes peer to the hub. Build it yourself with VNet peering. Cost: per-peering and per-GB egress.
- Virtual WAN: Microsoft-managed hub-and-spoke at global scale. Single Virtual WAN resource holds multiple hubs (one per region) and orchestrates peering between them. Supports VPN, ExpressRoute, and Secure Hub (built-in Firewall + Routing Intent).
- Gateway transit flag pair: on the hub-side peering, enable
Allow gateway transit; on the spoke-side, enableUse remote gateways. Now the spoke uses the hub's VPN/ExpressRoute Gateway. Without it, every spoke needs its own gateway. - Peering non-transitivity: A↔B and B↔C does NOT give A↔C. To allow spoke-to-spoke traffic in hub-and-spoke, deploy an NVA or Azure Firewall in the hub and add UDRs in each spoke. Virtual WAN's Secure Hub does this natively via Routing Intent.
- Address space planning: assign a region-wide supernet (e.g., 10.10.0.0/16 for eastus2). Carve spokes from that. Reserve 10.255.0.0/16 (or similar) for the hub's GatewaySubnet / AzureFirewallSubnet / AzureBastionSubnet — those need specific names AND minimum sizes.
- When to pick which: < 10 spokes in one region → hand-rolled hub-and-spoke is fine. Multi-region or > 20 spokes → Virtual WAN. Need to integrate SD-WAN appliances → Virtual WAN. Tight cost optimisation in a small footprint → hand-rolled.
A multinational retailer with three regions and 60 spoke VNets total. Choice: Virtual WAN with one hub per region. Each hub is a Secure Hub running Azure Firewall with Routing Intent configured for private + internet inspection. Spokes peer to their regional hub; cross-region traffic flows through the Microsoft backbone between hubs at no per-peering cost. ExpressRoute circuits terminate at one hub per region; failover via Global Reach.
Connecting on-prem to Azure has three primary options: VPN over the public internet, ExpressRoute private circuits, and SD-WAN integration. AZ-305 tests when to pick each based on bandwidth, SLA, latency, and cost.
- Site-to-Site VPN: IPsec tunnel over the internet. SKUs
VpnGw1throughVpnGw5scale bandwidth and concurrent tunnels. Active-active across AZs for the gateway 99.95% SLA. Cheap (~$140/mo for VpnGw1) but no bandwidth guarantee. - ExpressRoute: private circuit through a connectivity provider — bypasses the public internet, comes with an SLA (99.9 or 99.95% with redundant circuits). Two peering types: private (Azure VNets), Microsoft (Microsoft 365 / public Azure PaaS over the private circuit).
- ExpressRoute Global Reach: connects two ExpressRoute circuits to each other through the Microsoft backbone. Useful for branch-to-branch traffic without backhauling through HQ.
- ExpressRoute FastPath: bypasses the ExpressRoute Gateway data path for direct circuit-to-VM traffic. Higher throughput, requires Ultra Performance SKU. Needed for >10 Gbps real-world throughput.
- SD-WAN integration: Virtual WAN supports SD-WAN partner appliances (Cisco SD-WAN, Aruba, Versa, etc.) as VNet-deployed CPE. The branch office connects to its SD-WAN appliance, which terminates into the regional Virtual WAN hub. Great for retail / 100+ branches.
- Cost vs criticality: VPN for dev/test and small businesses; ExpressRoute for production with bandwidth or SLA requirements; ExpressRoute + VPN failover when ExpressRoute downtime would cost more than the VPN keep-alive.
A bank needs to connect HQ (200 Mbps, sustained) and 40 branch offices to Azure with SLA-backed connectivity. Design: ExpressRoute circuit at HQ (500 Mbps, dual circuits in different ER providers for 99.95% SLA), terminated at the Virtual WAN hub. Branches connect via SD-WAN appliances (Cisco vEdge in Virtual WAN partner pattern) that route into the same hub. Backup VPN tunnels from HQ to Azure as failover for the unlikely dual-ER outage.
The last network design lesson covers the three pieces every modern Azure architecture needs: private endpoints (no public IPs), Azure DNS Private Resolver (hybrid name resolution), and global load balancing (Front Door vs Traffic Manager).
- Private endpoints: a NIC in your VNet with a private IP that resolves to a PaaS service (Storage, SQL, Key Vault, App Service, &c.). Disable the public endpoint entirely once private endpoints are in place. The modern AZ-305 default for any PaaS-to-VNet integration.
- Azure Private DNS zones: required for name resolution to the private endpoint's IP. Each PaaS service has a documented Private DNS zone name (
privatelink.blob.core.windows.net,privatelink.database.windows.net). Link the zone to every VNet that needs to resolve. - Azure DNS Private Resolver: the managed bridge for hybrid DNS. Inbound endpoints answer queries from on-prem; outbound endpoints forward configured zones to on-prem resolvers. Replaces the old DNS-forwarder-VM pattern.
- Service endpoints vs private endpoints: service endpoints route traffic to PaaS over the Microsoft backbone but keep the public IP. Private endpoints make the PaaS endpoint private. New designs almost always pick private endpoints.
- Traffic Manager (DNS): global load balancer based on DNS resolution. Methods: Performance (lowest latency), Geographic (per source country), Weighted, Priority (active/passive). Failover bounded by client DNS TTL — typically 30s+. Use when sub-second failover isn't required.
- Azure Front Door (anycast): global L7 reverse proxy at Microsoft edge POPs. WAF, CDN caching, sub-second failover, URL-based routing. Premium tier adds private origin support (your origins don't need public IPs). The modern default for global L7.
- Decision pattern: Regional internal/external → Load Balancer / Application Gateway. Global L7 + WAF + edge caching → Front Door. Global DNS-based without WAF/caching → Traffic Manager. Static assets only → Front Door / CDN.
A global SaaS needs (1) WAF and edge caching at <100ms RTT globally, (2) all PaaS services accessed privately from the app VNet, (3) on-prem clients can resolve Azure Private DNS records. Design: Azure Front Door Premium in front, two regional origin groups, WAF in Prevention mode, caching /static/*. App VNet uses private endpoints for Storage, SQL, Key Vault — all public endpoints disabled. Azure DNS Private Resolver with inbound endpoint exposes the Private DNS zones to on-prem resolvers via DNS conditional forwarders.
📖 Read in-depth chapter ▾
Azure compute is a ladder of abstractions from raw VMs up to serverless functions. AZ-305 asks you to climb the ladder — pick the highest-abstraction option that satisfies the requirement, because each step up cuts operational toil.
- Azure VMs: full control + responsibility for OS patching. Use only when you need OS-level access, specific kernels, GPUs not in PaaS, or licensing tied to physical CPU sockets.
- App Service: managed web hosting. Standard / Premium / Isolated tiers. Slot swap, autoscale, custom domains, managed identities. Right for stateless web apps, APIs, background jobs (WebJobs).
- Azure Container Instances (ACI): single-container serverless runs. Pay per second. Right for batch jobs, ephemeral tasks, virtual-kubelet bursts from AKS.
- Container Apps: managed Kubernetes-shaped serverless containers. KEDA-based scaling (including scale-to-zero), Dapr sidecars, revisions for blue/green. The modern "I have a container but don't want full AKS" answer.
- AKS: managed Kubernetes — control plane free, pay only for worker nodes. Right when you need K8s primitives, complex orchestration, helm ecosystem, or third-party operators.
- Functions: event-driven serverless. Consumption (true serverless), Premium (warm), App Service plan (predictable cost). Right for trigger-driven workflows, glue code, event handlers.
- Azure Batch: managed batch + HPC. Scale-to-zero, Spot VM support, low-priority pools. Right for one-off massive parallel jobs (rendering, scientific compute).
A SaaS team has: (1) a long-running web tier (state in DB only), (2) a containerised image-processing service that scales 0 → 200 on bursts, (3) a CSV-to-DB ETL that runs nightly, (4) a research team's nightly molecular-dynamics simulation across hundreds of cores. Choices: (1) App Service Premium V3 with autoscale. (2) Container Apps with KEDA queue-triggered scale 0-200. (3) Function App on Consumption plan, time trigger. (4) Azure Batch low-priority Spot pool — scales to zero between runs.
Azure API Management (APIM) sits in front of your APIs and adds rate limiting, transformation, auth, and a developer portal. AZ-305 tests when APIM is the right choice and which tier / topology fits.
- Tier ladder: Consumption (pay-per-call serverless), Developer (non-prod single instance), Basic / Standard / Premium (multi-region, multi-AZ, VNet integration), Isolated (dedicated single-tenant, regulated industries).
- Premium-only features: multi-region deployment, VNet integration with internal mode (no public IP), self-hosted gateway (run APIM gateway on-prem or in another cloud), Availability Zone deployment.
- Policies: XML pipeline (inbound, backend, outbound, on-error). Rate limit, set-header, set-backend-service, validate-jwt, cache-lookup, mock-response, return-response. The composability is what makes APIM more than just a reverse proxy.
- Products and subscriptions: a product is a curated bundle of APIs with a subscription key. Consumers subscribe to products via the developer portal. Per-product rate limits and approval workflows.
- Backends: point to App Service, Function App, AKS, on-prem (via VNet), or third-party HTTP endpoints. APIM caches credentials and provides one consistent client-facing surface for many backends.
- Self-hosted gateway: the APIM gateway as a container you run anywhere — on-prem, AWS, edge. Control plane in Azure; data plane wherever needed. Solves "API is on-prem but I want APIM features".
A bank exposes 40 internal APIs to external partners with: (1) per-partner rate limits, (2) JWT validation against partner IdPs, (3) audit logging, (4) regulatory requirement that one of the APIs runs in-DC behind a firewall. Choice: APIM Premium with multi-region deployment for SLA, VNet integration in internal mode. Each partner gets a Product wrapping selected APIs, with per-product rate-limit policies. Self-hosted gateway runs on-prem for the regulated API; same APIM management plane handles policy.
AZ-305 tests migration design more than tooling specifics. Know what each tool does, which one is the right fit, and the order to combine them in a phased migration.
- Azure Migrate: the umbrella tool with Discovery & Assessment, Server Migration (powered by ASR), and Database Migration (powered by DMS). Always start here — the discovery appliance maps on-prem dependencies and right-sizes Azure VMs.
- Database Migration Service (DMS): two modes — offline (cutover at the end, downtime equal to migration), online (continuous replication, near-zero downtime cutover via DNS flip). Online supports SQL Server → Azure SQL Database / Managed Instance, MySQL, PostgreSQL.
- Server Migration: uses ASR-style replication. Same continuous-replication model, planned-failover semantics for the cutover. Right for VM-level migrations from VMware, Hyper-V, AWS, GCP.
- Service Bus: enterprise message bus — queues, topics/subscriptions, sessions for ordered processing, dead-letter queues, scheduled messages, duplicate detection. Pick over Event Grid when you need queue-style competing-consumer semantics and ordering.
- Event Grid: publish/subscribe for state-change events at massive scale. System topics from Azure services (blob created, VM started), custom topics from your apps, partner topics from SaaS. Push delivery, retries, dead-letter.
- Decision pattern: Service Bus when you need ordering / sessions / queues / competing consumers / FIFO. Event Grid when you need pub/sub with massive fan-out. Both can coexist in a single architecture (Service Bus for transactional flows; Event Grid for state-change broadcasts).
A bank migrates a 100-VM on-prem datacentre with an Oracle DB to Azure with near-zero downtime. Plan: phase 1 — Azure Migrate Discovery appliance maps dependencies, right-sizes VMs, identifies the Oracle DB as the integration bottleneck. Phase 2 — Server Migration replicates VMs continuously, app team validates in a test failover. Phase 3 — DMS online mode replicates Oracle → Azure Database for PostgreSQL (with schema conversion via SSMA). Cutover: short read-only window, DMS final sync, DNS flip to the Azure endpoints. Total cutover downtime: < 10 minutes.
📖 Read in-depth chapter ▾
Hard-coded secrets are the easiest credential-theft path. Managed identities eliminate them — Azure issues short-lived tokens to your resource automatically. AZ-305 expects you to design every service-to-service auth around managed identities by default.
- System-assigned MI: tied to the resource's lifecycle — created/deleted with the resource. Use when one identity per resource is the right scope. The default for most cases.
- User-assigned MI: standalone resource you create, assign to multiple resources. Use when many resources share one role (e.g., a fleet of VMs all accessing the same Key Vault) or when you need to grant access BEFORE the resource exists.
- Federated identity credentials: let an external workload (GitHub Actions, AWS, Kubernetes via OIDC) trade its identity for an Azure token without secrets. Right for CI/CD pipelines connecting to Azure.
- RBAC + Key Vault references: grant the MI a data-plane role on the target (Storage Blob Data Reader, SQL DB Contributor, Key Vault Secrets User). In App Service / Functions, use
@Microsoft.KeyVault(SecretUri=...)in app settings — the runtime fetches with the MI; no code changes. - Service Connector: newer abstraction that wires identity + connection for App Service / Container Apps / Functions → DB / Storage / Cache. Reduces the boilerplate of provisioning + permission + config string.
- Audit trail: every MI token request is logged in the resource's Activity Log and (if signed-in) in Entra ID sign-in logs. Forensic-ready by default — vs hard-coded secrets which leave no audit trail when reused.
A web app on App Service needs to read secrets from Key Vault, write blobs to Storage, and connect to Azure SQL DB. Design: enable system-assigned managed identity on the App Service. Grant: Key Vault Secrets User on the vault, Storage Blob Data Contributor on the storage account, db_datareader + db_datawriter mapped to the MI on the SQL DB. App settings reference Key Vault secrets via @Microsoft.KeyVault URIs. No connection strings with passwords anywhere. Pipeline uses federated identity credentials from GitHub Actions OIDC for deploys.
AZ-305 covers IaC at the architectural level — which tool fits which org, how to gate deploys, how to handle drift. Picking the right tool is as much culture as technology.
- ARM JSON: the legacy template format. Verbose, ugly, but the substrate everything else compiles to. Read-only in practice — you'll rarely hand-write ARM in 2026.
- Bicep: Microsoft's DSL over ARM. Cleaner syntax, modules, conditional resources, loops. Transpiles to ARM. Free, first-class Azure support. The native AZ-305 IaC answer.
- Terraform: HashiCorp's multi-cloud IaC. Same use cases as Bicep for Azure-only, plus consistent tooling if your org runs multi-cloud. Use when the org already has Terraform skills.
- Deployment Stacks: a managed deployment with deny assignments — Azure prevents resources owned by the stack from being modified or deleted outside the stack. Replaces Azure Blueprints. Critical for governance scaffolding.
- Template Specs: versioned Bicep/ARM templates stored as Azure resources. Share approved templates centrally; consumers deploy by reference. Used in landing-zone patterns to publish a "compliant network" template that BU teams reuse.
- Drift detection: Bicep / Terraform
planshows the diff between declared state and actual. Azure Policy (withAuditeffect) detects post-deploy configuration changes. Combine — IaC for greenfield + Policy for ongoing compliance enforcement.
An enterprise builds a landing zone — repeatable subscription scaffolding with mandatory policies, hub VNet, monitoring workspace. Old way: Azure Blueprints. Modern way: Bicep modules published as Template Specs, deployed via Deployment Stacks with denySettings: denyDelete so BU teams can't accidentally delete the shared resources. Azure Policy initiative attached at the MG above the subscription enforces ongoing compliance (no public IPs without exemption, etc.). Drift detection via Policy audit + monthly Bicep diff reports.
The final design domain on AZ-305 is FinOps — designing for cost is as architecturally important as designing for availability. The exam tests cost-management primitives, commitment-based discounts, and the trade-offs of cost-optimised architectures.
- Reservations: 1- or 3-year commitments on specific SKUs (VMs, SQL DB, Cosmos DB, App Service, Synapse). 30-65% discount. Use for steady-state predictable workloads. Convertible reservations let you exchange SKUs.
- Savings Plans (compute): 1- or 3-year commitment to a $/hour spend across compute (VMs, App Service, Container Instances, Functions Premium). More flexible than VM Reservations — covers SKU changes automatically.
- Spot VMs: up to 90% discount, can be evicted with 30-second notice. Right for fault-tolerant batch / dev-test / stateless workloads. Combine with VMSS for graceful eviction handling.
- Azure Hybrid Benefit: use existing Windows Server / SQL Server licenses on Azure to skip the per-OS cost. Major saving on lift-and-shift migrations.
- Cost analysis + Budgets: Cost Management gives you slice-by-tag analytics and per-subscription budgets with threshold alerts. AZ-305 expects you to design "tag-everything" from day 1 so chargeback works.
- FinOps tooling integration: export Cost Management data to Storage → ingest into Power BI or third-party (CloudHealth, Apptio). Combine actual cost with capacity / utilisation metrics to find under-used resources for right-sizing.
- Architecture trade-offs: serverless (Consumption Functions, Container Apps scale-to-zero) trades latency for cost at low utilisation. PaaS (App Service) is a middle ground. IaaS (VMs) is cheapest at high steady utilisation if you've also bought reservations.
A 24/7 SaaS with predictable baseline + 4× peak. Cost design: baseline VMs covered by a 3-year Savings Plan for the steady tier (~50% discount). Peak handled by Spot VMSS instances (90% discount, accept eviction during scale-down). Windows licenses come from on-prem via Azure Hybrid Benefit. Cost Management budgets at the subscription level with 80%/100% alerts; tagging policy enforces CostCenter + Environment + Owner on every resource via Modify policy effect. Monthly Power BI report cross-references cost with VM Insights utilisation to flag right-sizing candidates.
Test your AZ-305 knowledge
60 scenario-based questions covering all 4 domains. No signup required.
Key AZ-305 concepts to master
Management Groups vs Azure Policy scope
Azure Policy assigned at the Management Group root cascades to ALL subscriptions underneath — including future subscriptions. Many candidates confuse Azure Policy (enforcement/compliance) with Azure Blueprints (deployment scaffolding) and Azure RBAC (access control). These are three distinct tools. A single Azure Policy at the root MG replaces the need to configure identical policies in each of 80+ subscriptions manually.
Availability Sets vs Availability Zones
Availability Sets protect against rack/hardware failure within a single datacenter — useful when a zone-redundant SKU isn't available. Availability Zones are physically separate datacenters with independent power and networking — they protect against full datacenter failure. The AZ-305 exam frequently tests this distinction. For new greenfield deployments, Availability Zones (with zone-redundant Load Balancer + VMSS) is always the preferred answer over Availability Sets.
Cosmos DB consistency levels & multi-master write
For active-active global deployments, Cosmos DB with multi-region write enabled allows writes to any region. The consistency level selection matters: Strong guarantees linearizability but incurs cross-region latency. Bounded Staleness or Session is preferred for global apps. SQL Database Active Geo-Replication creates read-only secondaries — you cannot write to secondary regions, making it unsuitable for true active-active patterns.
6-week study plan
Top 4 reasons candidates fail AZ-305
- Confusing governance tools: Azure Policy (enforce standards), Blueprints (subscription scaffolding), RBAC (access), and Locks (deletion prevention) — all serve different purposes. The exam writes scenarios where using the wrong tool compiles correctly but doesn't meet the stated requirement.
- Mixing up HA tiers: Availability Sets ≠ Availability Zones. Geo-Replication ≠ Auto-Failover Groups. Active-active ≠ active-passive. Memorize what each provides (datacenter, region-level) and its RTO/RPO characteristics.
- Cost optimization blind spots: Overlooking Blob lifecycle policies, Log Analytics Basic plan, SQL Serverless auto-pause, and Azure Batch scale-to-zero. AZ-305 has a dedicated cost optimization thread woven through every domain.
- Weak on Managed Identity: Many candidates default to connection strings and SAS tokens when Managed Identity is the zero-credential, least-privilege answer. The exam explicitly rewards this pattern for VM→Key Vault, ADF→ADLS Gen2, and AML→Storage scenarios.
AZ-305 vs AZ-104: What’s different?
AZ-104 (Azure Administrator) tests how to configure Azure services — deploying VMs, setting RBAC assignments, configuring storage. AZ-305 (Azure Solutions Architect Expert) tests how to design — which service combination best meets business requirements, cost constraints, and SLA targets.
AZ-104 is required before taking AZ-305. The architect exam assumes you can implement; it focuses on justifying architectural decisions under constraints: budget, compliance, RTO/RPO targets, team skill sets, and existing investments. Case study questions test multi-service design holistically.