Python for SEO Audits 2026: Automation & APIs
Master Python for SEO audits in 2026. Compare Google APIs, RFC 9309, async workflows, rate limits. Code patterns, pitfalls, and automation checklists.
Python is the most versatile language for automating technical SEO audits. This guide covers the 2026 stack—Google Search Console API, Indexing API, PageSpeed Insights (PSI), Chrome User Experience Report (CrUX), robots.txt parsing (RFC 9309), sitemap extraction, and structured data validation—with Python code patterns, authentication setups, rate limit management, and common pitfalls. You’ll learn when to use async batch processing, how to avoid hidden GSC data gaps, and how to stay current with Google’s evolving API landscape.
The 2026 Python SEO Audit Stack
The ecosystem has shifted in several key ways since 2024:
- Google Custom Search JSON API is deprecated for new customers as of January 1, 2027. Existing projects continue but receive no new features. For site‑specific search, Google recommends Agent Search / Discovery Engine (Vertex AI) using gRPC Python clients (Source: Google Developers).
- Async Python patterns (FastAPI, httpx, aiohttp) are now standard for batch operations—handling 2,000+ URL inspections or 50 concurrent PSI requests without blocking.
- AI coding assistants (Claude, Copilot, Cursor) now generate API integration scripts from plain‑English prompts, letting practitioners focus on audit logic rather than boilerplate.
- Pay‑per‑query API pricing (e.g., DataForSEO, Serper) competes with free Google quotas, making budget allocation a new skill for in‑house SEO engineers.
Recommended Python Libraries (2026)
| Use Case | Library |
|---|---|
| HTTP / Async | requests (sync), httpx (async + HTTP/2), aiohttp |
| Google APIs | google-api-python-client |
| Authentication | google-auth, google.oauth2.service_account |
| Data analysis | pandas, numpy |
| Parsing | beautifulsoup4, lxml, urllib.robotparser (stdlib), advertools (sitemaps) |
| NLP / scoring | spaCy, google-cloud-language (5,000 units/month free) |
| Retry / backoff | tenacity |
Why Official Documentation Matters in 2026
- Google APIs are living documents; deprecations are announced via changelogs (e.g., Custom Search API).
- IETF RFC 9309 is the only authoritative standard for robots.txt—community interpretations are often wrong.
- Schema.org and Rich Results Test docs define allowed syntax; third‑party validators may miss Google‑specific nuances.
- OAuth 2.0 scopes and rate limit numbers are mandatory reading to avoid 403s and quota blocks.
Google Search Console (GSC) API
Official Docs & Quickstart
Google provides a complete OAuth 2.0 sample app at developers.google.com/webmaster-tools/search-console-api-original/v3/quickstart. The API version is v1 (service name 'searchconsole'). Install the client with pip install --upgrade google-api-python-client.
Authentication & Scopes
- OAuth 2.0 for Web Server Applications (recommended for user‑triggered scripts): scopes
webmasters.readonlyorwebmasters. Flow: Google API Console → activate GSC → request scope → consent → token → refresh. - Service Accounts (for automated, unattended audits): create service account in Cloud Console, download JSON key. Use
google.oauth2.service_account.Credentials.from_service_account_file()in Python. Domain‑wide delegation is required: set up in Google Workspace Admin console and call.with_subject('[email protected]'). Critical: the service account must be added as a verified owner of the property in Search Console.
Pitfall: Many tutorials only show OAuth web flows, leaving team‑run scheduled scripts broken because they lack a user sitting at a browser. Use service accounts for cron jobs.
Key Endpoints & Code Snippets
Sites.list – returns list of verified properties.
Sitemaps.list – returns per‑sitemap metadata (path, lastSubmitted, errors, contents).
Search Analytics.query – the core audit endpoint.
request = {
"startDate": "2026-01-01",
"endDate": "2026-02-01",
"dimensions": ["QUERY", "PAGE"],
"rowLimit": 25000,
"startRow": 0
}
response = service.searchanalytics().query(
siteUrl='sc-domain:example.com',
body=request
).execute()
Response keys: rows (list of dicts with clicks, impressions, ctr, position, keys). Pagination via startRow; you can extract up to 50,000 rows per property per day, far beyond the 1,000 rows the GSC UI shows.
URL Inspection.index.inspect – available since January 2022.
response = service.urlInspection().index().inspect(body={
"siteUrl": "sc-domain:example.com",
"inspectionUrl": "https://www.example.com/page"
}).execute()
Returns indexStatusResult (coverageState, googleCanonical, indexingState, lastCrawlTime, robotsTxtState, sitemap, verdict), mobileUsabilityResult, richResultsResult, and inspectionResultsLink.
Rate Limits (Source: Google Usage Limits, updated 2025‑08‑28)
| Endpoint | Short‑term | Long‑term |
|---|---|---|
| Search Analytics | 1,200 QPM per site/user | 30,000,000 QPD per project |
| URL Inspection | 600 QPM per site | 2,000 QPD per site |
| Other endpoints | 20 QPS per user | 100,000,000 QPD per project |
Hidden data: The GSC UI truncates query data at 1,000 rows. The API exposes the full long‑tail. One study found $40k/year in lost visibility from hidden long‑tail queries (Source: MarketMuse talk). Always paginate via the API.
Error handling: “quota exceeded” → wait 15 minutes or reduce dimension complexity. Grouping by both QUERY and PAGE is the most expensive dimension combination.
Indexing / Search APIs
Indexing API v3
Used to notify Google of new or updated pages (primarily job postings and events). Official Python library: google-api-python-client, service indexing_v3. Strongly recommend service account authentication. No official batch sample; implement your own batch loop with exponential backoff.
Google Custom Search JSON API (Deprecation)
- Status: Not available for new customers after January 1, 2027 (announced February 2026).
- Free tier: 100 queries/day; additional $5 per 1,000 queries; max 10,000 queries/day.
- Limitations: max 10 results per request, max 100 total results per query (start parameter cannot exceed 91). Results are cached and may be out of sync with real‑time index. This is NOT a replacement for Google Web Search (no organic analytics).
import requests
url = f"https://www.googleapis.com/customsearch/v1?key={API_KEY}&cx={CX}&q={query}&start={(page-1)*10+1}"
data = requests.get(url).json()
search_items = data.get("items", [])
Migration path: Agent Search / Discovery Engine (Vertex AI) is the recommended replacement for site‑specific search.
Agent Search / Discovery Engine API (2026 Alternative)
- Python client:
google-cloud-discoveryengine(gRPC). - Authentication: Service account via Application Default Credentials.
- Features: exact match, AND/OR, pagination (
pageSizeup to 100), boosting, filtering, user pseudonymization. - Use cases: build custom site search or internal knowledge base—not for web‑wide SERP data.
PageSpeed Insights & CrUX APIs
PSI API v5
Endpoint: GET https://www.googleapis.com/pagespeedonline/v5/runPagespeed. Authentication: API key only (simplest). Response contains lab data (Lighthouse scores for performance, PWA, accessibility, best practices, SEO) and field data (CrUX real‑world metrics aggregated over 28 days). strategy parameter: mobile or desktop.
CrUX API (Separate Endpoint)
Endpoint: POST https://chromeuxreport.googleapis.com/v1/records:queryRecord?key={API_KEY}. API key only, no OAuth.
payload = {"origin": "https://example.com", "formFactor": "PHONE"}
response = requests.post(
f"https://chromeuxreport.googleapis.com/v1/records:queryRecord?key={API_KEY}",
json=payload
)
metrics = response.json()['record']['metrics']
p75_lcp = float(metrics['largest_contentful_paint']['percentiles']['p75'])
- Data updated monthly; aggregated over 28‑day window.
- Not all origins have sufficient data—handle
KeyError. - CrUX now offers 25‑week history via API.
- For large batches, use BigQuery CrUX dataset (free via Google Cloud) instead.
Batch Processing Best Practices
No explicit rate limits are published for PSI or CrUX, but API key throttling may apply (test with 429 Too Many Requests). For 2,000+ URLs, use async httpx:
import httpx, asyncio
async def fetch_psi(url):
async with httpx.AsyncClient() as client:
r = await client.get(
f"https://www.googleapis.com/pagespeedonline/v5/runPagespeed?url={url}&key={API_KEY}"
)
return r.json()
Combine PSI and CrUX into a single DataFrame with lab score, field metrics, and push to a dashboard.
Robots Exclusion Protocol & Sitemaps
RFC 9309 – The Official Standard
Published September 2022 by M. Koster, G. Illyes (Google), H. Zeller (Google), L. Sassman (Google). Key rules:
- File must be named
robots.txt(case‑sensitive), UTF‑8,text/plainMIME type, at top‑level path. - User‑agent matching: case‑insensitive;
*for all; multiple groups merged. - Allow/Disallow: case‑sensitive; most specific wins (most octets); if equal, Allow wins.
- Special characters:
#comment,$end‑of‑pattern,*zero‑or‑more any character. - Percent‑encoding: compare after un‑encoding unreserved chars.
- Caching: should not use cached version older than 24 hours unless unreachable.
- If server returns 500–599, assume complete disallow; if unreachable for 30 days, may assume unavailable or use cached copy.
- Must follow at least five consecutive redirects.
Python Parsing
Stdlib urllib.robotparser:
from urllib.robotparser import RobotFileParser
rp = RobotFileParser()
rp.set_url('https://example.com/robots.txt')
rp.read()
can_fetch = rp.can_fetch('Googlebot', 'https://example.com/page')
delay = rp.crawl_delay('Googlebot')
Limitation: stdlib parser may not fully support RFC 9309 (e.g., wildcard * in path patterns). Recommended: use Google’s open‑source parser at github.com/google/robotstxt or the robotstxt Python package.
Empirical Compliance (2025 Study)
A Duke University study (IMC ’25) tracked 130 self‑declared bots over 40 days: bots were less likely to comply with stricter directives (arXiv 2505.21733). Additionally, David Hurley (2025) found that 0% of five tested AI agent frameworks check robots.txt before fetching content. The Structured Output Markup (SOM) approach can achieve equal accuracy while consuming 54.7% fewer tokens (Source: Hurley).
Sitemaps Protocol
Standard at sitemaps.org. Python library advertools:
import advertools as adv
df = adv.sitemap_to_df('https://example.com/robots.txt') # parses Sitemap directive
df[['loc', 'lastmod', 'sitemap', 'sitemap_size_mb']].head()
Common Pitfall: advertools may return only the last value for tags with multiple values—verify with manual XML parsing.
GSC Sitemaps API (google-api-python-client) can list, get, submit, and delete sitemaps.
Structured Data Validation
Official Sources
- Rich Results Test: search.google.com/test/rich-results. Supports JSON‑LD, RDFa, Microdata.
- Schema.org: base vocabulary, but Google supplements with documentation (e.g., Product, Review, FAQ).
- Google Structured Data Guidelines: developers.google.com/search/docs/appearance/structured-data/intro-structured-data.
Rich Results Test Features (2026)
- Two modes: URL (live crawl) or Code (paste snippet).
- User agent: Smartphone (default) or Desktop.
- Result states: green (valid), red (invalid, blocks rich result), orange (optional field missing).
- Crawling section: shows if page is accessible; warns if CSS/JS are blocked.
- Preview: for supported types (Recipe, JobPosting, FAQ).
- Share & History: results saved ~90 days.
Key limitation: The test does NOT verify if structured data matches the visible content of the page—only syntax and required fields.
Python Integration
No official Python client for Rich Results Test; use the Testing Tools REST API or leverage GSC URL Inspection:
rich = response['urlInspectionResult']['richResultsResult']
verdict = rich.get('verdict') # "PASS" or "FAIL"
detected = rich.get('detectedItems', [])
Practical Validation Workflow
- Crawl site with BeautifulSoup → extract
application/ld+jsonblocks. - Validate JSON syntax via
json.loads(). - Cross‑reference with schema.org required fields (e.g.,
LegalServicerequiresaddress). - Submit batch URLs to GSC URL Inspection API → parse
richResultsResult.verdict. - Log all FAIL verdicts and missing fields.
- Quarterly audit recommended (Source: GavelGrow blog, 2026).
Authentication Summary
| API | Primary Auth | Alternative | Python Library | Refresh |
|---|---|---|---|---|
| GSC | OAuth 2.0 (web) | Service account + delegation | google-api-python-client |
Refresh token (valid 6 months) |
| Indexing API | OAuth 2.0 (service account) | – | google-api-python-client |
JWT (1 hour auto‑renew) |
| PSI v5 | API key | OAuth token | requests or client |
Key rotation manual |
| CrUX API | API key | – | requests |
Key rotation manual |
| Custom Search | API key | – | requests |
Key rotation manual |
| URL Inspection | OAuth 2.0 | API key (Testing Tools) | google-api-python-client |
Same as GSC |
| Agent Search | Service account (ADC) | – | google-cloud-discoveryengine |
Automatic via ADC |
OAuth gotchas:
- Refresh token expires if not used for 6 months.
- Max 100 refresh tokens per client ID per account.
- Authorization codes: 256 bytes; access tokens: 2048 bytes; refresh tokens: 512 bytes.
- Testing OAuth consent screen: refresh token expires in 7 days if user type is external & testing status.
Automation Workflow Blueprint (2026)
Typical Audit Pipeline
- Robots.txt & Sitemap Discovery →
urllib.robotparser+advertools.sitemap_to_df(). - Bulk URL Collection → flatten sitemap, deduplicate, optionally filter by lastmod.
- Structured Data Extraction → BeautifulSoup parse
<script type="application/ld+json">, validate with Schema.org rules. - Rich Results Verification → batch GSC URL Inspection (2,000/day) → check
richResultsResult.verdict. - Performance & UX → batch PSI + CrUX (async httpx) → store in DataFrame with p75 scores.
- Search Visibility → GSC Search Analytics query (top queries, pages, device breakdown) → export to CSV / BigQuery.
- Indexing Status → URL Inspection
indexStatusResult.coverageState. - Alerting → compare previous run → if coverage drops >5%, send Slack alert.
Decision Tree: Sync vs. Async
- < 500 URLs: sync with
requests(simple, less code). - 500–2,000 URLs: async with
httpxandasyncio. - > 2,000 URLs: async + rate limiting; consider splitting runs across multiple days.
Scheduling & Monitoring
- Use GitHub Actions or cron for weekly runs.
- Wrap API calls with
tenacityfor exponential backoff. - Log all errors and quota warnings to a separate file.
Frequently Asked Questions
How many URLs can I check per day with GSC URL Inspection?
2,000 queries per site per day. For larger sites, split batches across multiple days or prioritize high‑value pages (e.g., product or article pages).
How do I handle OAuth refresh token expiry?
Store the refresh token securely and re‑authenticate via the consent screen if it expires. For service accounts, you never deal with refresh tokens—JWTs auto‑renew for one hour.
Should I use service accounts for GSC?
Yes, for unattended scheduled scripts. You need to add the service account as a verified owner of the property and set up domain‑wide delegation in Google Workspace.
How do I parse robots.txt correctly?
Use urllib.robotparser for basic checks, but note it may not fully support RFC 9309 wildcards. For production, use Google’s C++ parser via Python bindings or the robotstxt community package.
Why is GSC API data different from the UI?
The UI truncates at 1,000 rows. The API can return up to 50,000 rows per property per day, revealing hidden long‑tail queries that may drive significant traffic.
What does the Rich Results Test NOT catch?
It only validates syntax and required fields, not whether the structured data matches the visible page content. For that, you need to cross‑reference with the rendered page.
Is it safe to use API keys exposed in client‑side code?
No. API keys are meant for server‑side requests. Hardcode them in environment variables and never push to public repos.
Can I use the Custom Search API after January 2027?
Only if you already have an active project. New projects cannot acquire API keys after that date. Migrate to Agent Search for new site search features.
How do I stay updated on API changes?
Monitor Google Developers changelogs, subscribe to Search Central blog, and check the relevant GitHub repos for library updates.
Final Notes
Python for SEO audits in 2026 is API‑first, async, and increasingly AI‑assisted. The most valuable skill is not writing code snippets—it’s understanding the official documentation, respecting rate limits, and designing automation that doesn’t break when Google deprecates an endpoint. Always test with small samples, log aggressively, and prefer service accounts for production scripts.
For deeper dives into specific topics, explore the SEO1 Library guides on technical SEO audit automation and search API best practices.
Originally published in the EcomExperts SEO library.