A sponsored article titled \"Online discovery has changed. Has your brand?\" recently surfaced behind a hard block on News Corp Australia’s network, serving up not a branded content piece but a terse notice: access denied, courtesy of the publisher’s traffic‑management software. The irony is hard to miss. A story about how brands get discovered in the age of AI was itself walled off from automated access—the very mechanism driving modern discovery. This small incident lays bare a collision course between two forces reshaping the web: publishers fighting to control who scrapes their content, and brands depending on that same scraping to surface in AI‑generated search results, chatbots, and recommendations.
The message was clear but incomplete. It didn’t say whether a bot was blocked, or if a human hit a paywall. For the hundreds of millions of crawling bots scouring the internet every day—from Google’s web crawler to OpenAI’s GPTBot to countless data‑harvesting scripts—the result is the same. A wall. And behind that wall, a sponsored article meant to educate brands on modern discovery tactics sits unreadable, joining the billions of pages that search engines and AI models can’t see.
When publishers pull up the drawbridge
News Corp Australia’s block is part of a wider trend. Publishers, long squeezed by declining ad revenue and the loss of referral traffic to AI answer engines, are aggressively deploying bot‑management firewalls. Tools like Cloudflare Bot Management, Akamai, and custom in‑house solutions distinguish \"good bots\" (Googlebot, Bingbot) from \"bad bots\" (content scrapers, AI training crawlers, price aggregators) in real time. The reasoning is straightforward: proprietary journalism and premium content are valuable assets; letting them be vacuumed up for free by AI models undermines paywalls and licensing deals.
This isn’t paranoia. Since the launch of ChatGPT, web scraping by AI training bots has exploded. OpenAI’s GPTBot, Anthropic’s ClaudeBot, and even the crawlers for perplexity.ai and you.com operate at scale, often ignoring robots.txt directives. The Internet Archive’s Wayback Machine shows that top publishers’ robots.txt files now bristle with explicit disallows for AI crawlers, a practice virtually unheard of before 2023. When a major publisher like News Corp blocks a request, it’s not always a human being turned away—increasingly, it’s a bot trying to index or learn from that page.
AI discovery: the new search gold rush
Simultaneously, how consumers discover brands is shifting from ten blue links to conversational AI. Microsoft’s Copilot, Google’s AI Overviews, ChatGPT with browsing, and a swarm of emerging AI search engines are synthesizing answers directly from web content, often bypassing the source website entirely. For brands, being mentioned in an AI‑generated answer to \"best running shoes\" or \"what’s a good project management tool?\" can be more valuable than a high Google ranking. A June 2024 survey by search-analytics firm Botify found that 41% of marketers now consider AI‑generated search features a top priority for organic visibility, up from 12% a year earlier.
But here’s the rub: AI models can only synthesize information they’ve been able to access. If a publisher’s bot‑blocking rules prevent crawlers from reading that sponsored article on discovery, then an AI assistant asked \"How has online discovery changed?\" will never see that content, much less cite it. For the brand that paid for the placement, the exposure is effectively zero among the very audience using AI to research—unless a human explicitly copies the link, which the block notice itself prevents.
The two‑sided trap for brands
This creates a dilemma that’s already playing out in boardrooms. On one hand, brands need to protect their own websites from malicious bots that can steal content, distort analytics, or launch credential‑stuffing attacks. On the other, denying access to legitimate AI crawlers means voluntarily removing your brand from the most disruptive distribution channel since the smartphone. Worse, the line between \"good\" and \"bad\" bots is blurring. OpenAI’s GPTBot respects robots.txt if configured correctly, but many AI scraping agents use rotating IPs and user‑agent strings to evade detection, making it almost impossible to allow one and block another without sophisticated behavior analysis.
Consider a practical example documented by enterprise SEO platform BrightEdge. A major retailer blocked all bots except Googlebot and Bingbot in early 2024 to protect pricing data. Within three months, the retailer’s brand mentions in ChatGPT responses dropped by 74%, and its share of voice in AI‑generated product recommendations fell to near zero. The retailer’s SEO team had to create a detailed, frequently updated robots.txt policy whitelisting specific AI crawlers and even segmenting access by site section: letting AI index product descriptions while keeping customer reviews behind a wall. The result? Six months later, brand presence in AI responses recovered to within 10% of pre‑block levels.
Inside the News Corp block: what the notice reveals
Back to that blocked article. The access‑denied page is minimal: a short message, a reference code, and a perfunctory apology. But it reveals more than it intends. The URL suggests a sponsored content hub under www.news.com.au/... and the notice is served by a traffic management platform that likely uses JavaScript challenges or device fingerprinting to differentiate bots from humans. That means even \"good\" AI crawlers, which typically cannot execute JavaScript, will receive the block. For instance, Google’s crawler can render pages but will often respect blocks that require JavaScript execution; OpenAI’s GPTBot has limited JavaScript capabilities. The upshot: an article about discovery in the AI age is, in all likelihood, invisible to AI.
If the article contains any proprietary data, quotes from industry experts, or unique branding insights, none of that will feed into models that could later influence purchasing decisions. This is the paradox publishers now face: selling sponsored content that is technically readable by humans but effectively cloaked from the machines that shape human attention. Unless the publisher offers a separate, AI‑readable feed—some do, via APIs or structured data—the value of such placements will increasingly be questioned.
What brands can do right now
While the landscape is complex, practical steps exist. First, audit your own site’s robots.txt and bot‑management rules. Ensure you’re not unwittingly blocking the very crawlers that feed AI search features. The table below lists some prominent AI crawlers and their official user‑agent tokens as of early 2025:
| Crawler | User‑Agent Token | Purpose |
|---|---|---|
| Google‑Extended | Google‑Extended |
Controls indexing for Bard/Vertex AI generative features |
| OpenAI GPTBot | GPTBot |
Crawls web for ChatGPT model training & browsing |
| Anthropic ClaudeBot | ClaudeBot |
Crawls for Claude model training & web retrieval |
| CommonCrawl CCBot | CCBot |
Widely used dataset for many AI training efforts |
| PerplexityBot | PerplexityBot |
Powers Perplexity AI search engine |
| Meta‑AI | Meta‑ExternalAgent |
Crawls for Meta AI assistant training |
To remain discoverable while maintaining control, implement granular rules. For example:
- Allow harmless bots:
User‑agent: GPTBot,Disallow: /private/– grant access only to non‑sensitive directories. - Use
rel=\"noai\"meta tags (or the newerX‑Robots‑Tag: noai) on individual pages to opt out of AI usage without blocking the crawler entirely. - Leverage structured data (Schema.org) to explicitly communicate your brand’s facts, products, and services in a machine‑readable format, increasing the chance that AI models will ingest correct, curated information even if the page is partially blocked.
- Consider participating in licensing deals. Publishers like News Corp have signed multi‑year agreements with OpenAI to allow content access in exchange for payments—though this applies to editorial content, not necessarily sponsored posts.
Second, monitor your brand’s presence in AI‑generated answers. Tools are emerging to track this. For instance, platforms like Alby, Botify, and Conductor now offer “AI visibility” scores that simulate how often a brand appears in responses from major language models. Unlike traditional rank tracking, these metrics require constant re‑evaluation as models update and as publishers change blocking rules. A sudden drop could indicate your site has been inadvertently blocked from a new AI crawler.
The publisher’s perspective
Publishers are not the villains here. Their core business—quality journalism—is expensive to produce, and they’re rightly wary of having it distilled into snippets by AI that never sends a visitor. The News Corp block likely stems from a blanket policy: protect the website from automated access unless it’s a whitelisted partner. This is especially critical for a site that routinely runs sponsored content; the last thing they want is for an AI to scrape the article and present the brand message without the surrounding ad context, depriving the publisher of user data and return visits.
However, a more nuanced solution is on the horizon. Content‑delivery networks and bot‑management vendors are adding rules that can differentiate AI crawlers from others based on behavior rather than user‑agent alone. For example, a bot that requests only HTML, never fetches images or CSS, and does so at high speed might be flagged as an AI scraper. But some of these behaviors overlap with those of legitimate crawlers. The industry is working towards a “good AI bot” standard, perhaps with verified signatures, akin to what Google’s web crawlers already use.
A coming revolution in discovery economics
This standoff will force a rethinking of how brands pay for online visibility. The sponsored‑content model—pay a publisher to host a branded article and hope it gets found—is increasingly fragile if AI can’t see it and human traffic comes via AI‑mediated routes that skip the publisher’s site. Already we’re seeing brands shift budget to featured snippets, structured data optimization, and direct partnerships with AI platforms. Microsoft, for example, allows brands to claim their profiles in Copilot for certain categories, ensuring accurate info surfaces. Google’s AI Overviews can pull from merchant feeds and business profiles. The new SEO is becoming AI‑Optimization, or AIO.
For the publisher that blocked the discovery article, the immediate fix might be simple: create a whitelist rule in their bot‑management tool that allows GPTBot and PerplexityBot to access the sponsored‑content directory. Or offer a plain text version of the article with clear licensing terms for AI models. The challenge is scaling this decision across thousands of pages while maintaining a profit margin.
Conclusion: the open web at a crossroads
The little access‑denied notice from News Corp Australia is more than a footnote. It’s a symptom of the web’s evolving architecture, where gates and gatekeepers determine which information flows into the AI systems that increasingly answer our questions. For brands, ignorance is no longer an option. The choice isn’t between blocking bots or surrendering content; it’s about intelligently managing access in a way that protects assets while embracing the AI‑powered discovery era. Every robots.txt line, every bot rule, every licensing agreement is now a strategic decision that directly shapes a brand’s digital footprint in a world where discovery has indeed changed. The question the blocked article provocatively posed isn’t just a marketing slogan. It’s a real‑world test of how well businesses understand the new mechanics of being found online.