There are various causes you may require to locate many of the URLs on a website, but your specific goal will establish what you’re attempting to find. As an example, you may want to:
Recognize every indexed URL to investigate issues like cannibalization or index bloat
Obtain latest and historic URLs Google has observed, specifically for web site migrations
Find all 404 URLs to Recuperate from write-up-migration mistakes
In Just about every situation, one Device won’t Offer you anything you may need. Sad to say, Google Lookup Console isn’t exhaustive, along with a “website:illustration.com” search is proscribed and tough to extract information from.
During this publish, I’ll stroll you thru some applications to build your URL list and just before deduplicating the data employing a spreadsheet or Jupyter Notebook, based upon your internet site’s sizing.
Outdated sitemaps and crawl exports
Should you’re seeking URLs that disappeared in the live internet site lately, there’s an opportunity someone on your crew could possibly have saved a sitemap file or perhaps a crawl export before the modifications were built. For those who haven’t by now, check for these documents; they could usually provide what you require. But, in the event you’re examining this, you probably did not get so Blessed.
Archive.org
Archive.org
Archive.org is a useful Software for Web optimization responsibilities, funded by donations. In the event you hunt for a domain and select the “URLs” alternative, you'll be able to access as many as ten,000 shown URLs.
Nonetheless, there are a few restrictions:
URL limit: You could only retrieve around web designer kuala lumpur 10,000 URLs, that's insufficient for more substantial web pages.
High-quality: Numerous URLs could be malformed or reference resource documents (e.g., photographs or scripts).
No export alternative: There isn’t a constructed-in technique to export the list.
To bypass The shortage of an export button, use a browser scraping plugin like Dataminer.io. On the other hand, these limits suggest Archive.org may not offer a whole Alternative for larger sized web-sites. Also, Archive.org doesn’t point out whether Google indexed a URL—but when Archive.org identified it, there’s a great chance Google did, far too.
Moz Professional
Whilst you might normally make use of a backlink index to uncover exterior web sites linking to you, these instruments also explore URLs on your internet site in the process.
How to use it:
Export your inbound links in Moz Pro to secure a rapid and straightforward list of target URLs out of your web-site. If you’re managing an enormous Site, consider using the Moz API to export info over and above what’s manageable in Excel or Google Sheets.
It’s vital that you Notice that Moz Pro doesn’t affirm if URLs are indexed or identified by Google. Nevertheless, because most web-sites apply the exact same robots.txt principles to Moz’s bots since they do to Google’s, this technique usually will work well as being a proxy for Googlebot’s discoverability.
Google Search Console
Google Look for Console presents various precious sources for building your listing of URLs.
Hyperlinks reports:
Just like Moz Professional, the Links area presents exportable lists of goal URLs. Sad to say, these exports are capped at one,000 URLs each. You are able to use filters for certain pages, but since filters don’t implement on the export, you could have to rely on browser scraping instruments—restricted to 500 filtered URLs at a time. Not excellent.
Efficiency → Search Results:
This export provides a list of pages receiving look for impressions. Though the export is restricted, You should utilize Google Research Console API for greater datasets. You can also find cost-free Google Sheets plugins that simplify pulling much more considerable info.
Indexing → Web pages report:
This portion provides exports filtered by concern form, though these are definitely also constrained in scope.
Google Analytics
Google Analytics
The Engagement → Webpages and Screens default report in GA4 is an excellent source for amassing URLs, which has a generous limit of 100,000 URLs.
A lot better, you can implement filters to produce various URL lists, effectively surpassing the 100k limit. For example, if you need to export only blog URLs, stick to these methods:
Phase 1: Add a phase to the report
Move 2: Click “Make a new section.”
Phase 3: Outline the phase that has a narrower URL pattern, such as URLs that contains /weblog/
Take note: URLs found in Google Analytics might not be discoverable by Googlebot or indexed by Google, but they supply useful insights.
Server log data files
Server or CDN log information are Maybe the final word Instrument at your disposal. These logs capture an exhaustive checklist of each URL path queried by consumers, Googlebot, or other bots in the course of the recorded period of time.
Things to consider:
Facts measurement: Log files is usually substantial, a lot of web pages only keep the final two months of data.
Complexity: Analyzing log documents can be tough, but several tools are available to simplify the process.
Incorporate, and superior luck
Once you’ve collected URLs from every one of these resources, it’s time to combine them. If your site is small enough, use Excel or, for bigger datasets, applications like Google Sheets or Jupyter Notebook. Make certain all URLs are constantly formatted, then deduplicate the listing.
And voilà—you now have an extensive listing of current, previous, and archived URLs. Superior luck!