The little-known reason why competing with Google is so hard

Before a new search engine can hope to make a run against Google, it has to crawl. But indexing the web by “crawling” sites with automated software doesn’t just require scaling up to the web’s vast scope—even though doing so is a big challenge in itself. Individual sites have no obligation to welcome a new search crawler. Some instead post digital no-trespassing signs, a way to discourage automated traffic that might bog down performance. “The web has trillions of documents,” says Vivek Raghunathan, cofounder of the ad-free, subscription-based search startup Neeva. “And the web is a lot trickier to crawl than it was a few years ago.” An October 2020 report on digital competition by the House Judiciary Committee’s Subcommittee on Antitrust aimed a government spotlight at this situation. “The high cost of maintaining a fresh index, and the decision by many large webpages to block most crawlers, significantly limits new search engine entrants,” the report stated. “Today, the only English-language search engines that maintain their own comprehensive webpage index are Google and Bing.” That leaves many Google competitors renting the index Microsoft maintains for its Bing search, which has 6.4% of the U.S. market—compared to Google’s 87.3%—in Statcounter’s measurements. Bing’s index works well for many queries, but sites leaning on it cede a key way to differentiate themselves. That’s an issue for Neeva as well as two other privacy-centric search engines, DuckDuckGo and Brave. All three call on Bing for some of the results they provide to users. It’s just one ingredient rather than the entirety of their technology, but still: It would be easier to do without it if creating a new index of the web wasn’t so hard. Robots not welcome here Websites control automated access to their pages using standardized “robots.txt” files enumerating where crawlers may go. Crawlers can disregard these instructions, as the Internet Archive began doing in 2017, to improve its backup of the web. But sites can punish a pushy robot by blocking its access. DuckDuckGo and Neeva pointed to Facebook’s platform as one example. Its robots.txt file takes a guest-list approach, approving Google and Bing as well as such less obvious crawlers as “Applebot,” which gathers data for Apple’s Siri and Spotlight. But it excludes all bots not cited by name. Jason Grosse, a spokesperson for Facebook’s parent firm Meta, said in an email: “Generally speaking, our robots.txt policy is not out of line with other major platforms.” Indexing sites that don’t appreciate a new crawler’s attention can demand discretion and diplomacy. “A lot of the work we’ve done in the last year, year and a half, is building a crawler system that is well behaved,” said Neeva’s Raghunathan. “We do things like smart algorithmic estimation of how much can we crawl this site so it looks like a rounding error.” Sometimes, however, Neeva has to ask for help. From whom? “I’d say it’s been the first person we know, and often the first person we know is the CEO or the head of engineering.” Even a search site that excels at providing web results will struggle to match Google’s full-spectrum information retrieval.Brave, meanwhile, operates in a stealth mode by varying its crawler’s identification and only abiding by whatever restrictions a robots.txt file places on Google’s crawler. Josep M. Pujol, chief of search at Brave, founded by Mozilla cofounder Brendan Eich and better known for its privacy-focused browser, said in an email that this requires treading lightly. “We respect the spirit of the law but not the letter,” he said. “As of today, the data centers that host our crawlers have received a very small number of complaints.” Pujol called asking individual sites’ permission impractical: “How do you scale human interaction to thousands of companies?” Google, meanwhile, can get another leg up because its nonsearch lines of businesses—starting with display ads, but including services like Google Analytics—require access to sites that competitors can only request, said Zack Maril, a software engineer and founder of a search-competition group called Knuckleheads’ Club. These other ventures, he wrote in an email, “all can benefit from Google’s search business in various ways that other competitors running only search engines simply cannot compete on.” Search sites without Google- or Bing-level traffic also lack large-scale metrics about what sites are more or less popular. Google and Bing “can look at everything that people liked, and prioritize all the clicks from there,” says Raghunathan. “When you’re bootstrapping, it’s a lot harder.” A report on digital competition, published in July 2020 by the U.K.’s Competition and Markets Authority, suggested requiring Google to provide some of these metrics. As DuckDuckGo communications vice president Kamyl Bazbaz approvingly phrased it, “Share a certain amount of click-and-query data that other search engines could use to level the playing field.” Brave invites itself to a form of that sharing when it asks its users to allow “Google fallback mixing,” in which Brave sends along a query to Google and then analyzes the results to improve its index. Even a search site that excels at providing web results will struggle to match Google’s full-spectrum information retrieval. For example, I’ve had DuckDuckGo as the default on my iPad Mini for years—but its maps results only cover driving and walking, so I still find myself turning to Apple Maps and Google Maps. Despite the inherent challenges of competing with Google in search, the fact that new firms are still willing to try speaks well of the stubbornness that these upstarts will need. “We love that there are lots of other search competitors now,” said DuckDuckGo’s Bazbaz. “It’s a market that, historically, people have been really afraid of—and for good reason—because of the way that Google has dominated it.”

https://www.fastcompany.com/90709672/the-little-known-reason-why-competing-with-google-is-so-hard?partner=rss&utm_source=rss&utm_medium=feed&utm_campaign=rss+fastcompany&utm_content=rss

Vytvorené 4y | 7. 1. 2022, 11:21:11


Ak chcete pridať komentár, prihláste sa

Ostatné príspevky v tejto skupine

How AI is transforming corporate finance

The role of the CFO is evolving—and fast. In today’s volatile business environment, finance leaders are navigating everything from unpredictable tariffs to tightening regulations and rising geopol

5. 7. 2025, 13:10:03 | Fast company - tech
Want to move data between Apple and Google Maps? Try this  workaround

In June, Google released its newest smartphone operating system, Android 16. The same month, Apple previewed its next smartphone oper

5. 7. 2025, 10:40:07 | Fast company - tech
Tally lets you design great free surveys in 60 seconds

This article is republished with permission from Wonder Tools, a newsletter that helps you discover the most useful sites and apps. 

4. 7. 2025, 13:50:03 | Fast company - tech
How China is leading the humanoid robots race

I’ve worked at the bleeding edge of robotics innovation in the United States for almost my entire professional life. Never before have I seen another country advance so quickly.

In

4. 7. 2025, 9:20:03 | Fast company - tech
‘There is nothing that Aquaphor will not fix’: The internet is in love with this no-frills skin ointment

Aquaphor has become this summer’s hottest accessory.

The no-frills beauty staple—once relegated to the bottom of your bag, the glove box, or a bedside drawer—is now dangling from

3. 7. 2025, 23:50:07 | Fast company - tech
Is Tesla screwed?

Elon Musk’s anger over the One Big Beautiful Bill Act was evident this week a

3. 7. 2025, 17:10:05 | Fast company - tech