Relative URL / URI Paths for Beginners

I am seeing an increasing occurrence of web crawling bots whose implementation of relative path handling in hyperlinks is obviously broken. It seems that the people who implemented those bots, have no understanding of how relative paths are supposed to work, or they just suck at programming.

The worst example is GPTBot/1.2. One would expect at least a basic level of competence from the people who wrote that bot, especially because when asking ChatGPT itself about this, it will clearly explain the same principles as are listed below. I can only hope that all the crap I see in my logs that bear this bot's signature, are actually an imposter that also happens to run from Microsoft cloud servers, but somehow I doubt this is the case.

Here is a refresher on how relative URLs like ./ or ../somedir/somepage.html actually work.

It should be common knowledge that:

then those new paths must be resolved relative to the path of the redirected page, not to the original URL that produced the redirect.

The same goes for files referred from within another file. For instance a webpage links to a CSS file that contains paths to background images. If those paths are relative, they must be resolved relative to the URL of the CSS file, not of the page that links to the CSS file.

Also:

Examples

  1. A web crawling bot fetches a URL:
    https://www.stuffthings.com/oldpath/
  2. and the server responds with a 301 Moved Permanently, with header:
    Location: https://www.stuffthings.com/newpath/
  3. and the bot fetches the redirected URL:
    https://www.stuffthings.com/newpath/
  4. and this redirected page contains a link:
    <A HREF="images/bla.png">

then the bot must fetch the image at:
https://www.stuffthings.com/newpath/images/bla.png

As for the other types of links:

It is not rocket science. This is how relative paths have worked since the dawn of time. If you do not understand these basic principles, then please do not write some crawler that is then let loose on the entire internet.

If you just use standard path handling routines or libraries in your favourite programming language, you cannot make mistakes as you would do when going the idiotic route of implementing your own path handling. For instance in Python one can rely on urllib:

from urllib.parse import urljoin
source_uri = "https://www.somedomain.com/stuff/things.html"
link_uri = "../gizmos/doodads.html"
destination = urljoin(source_uri, link_uri)

# destination is now: "https://www.somedomain.com/gizmos/doodads.html"

Dr. Lex, 2025-06