Relative URL / URI Paths for Beginners

I am seeing an increasing occurrence of web crawling bots whose implementation of relative path handling in hyperlinks is obviously broken. It seems the people who implemented those bots, have no understanding of how relative paths are supposed to work, or they just suck at programming. Or maybe I am witnessing the first signs of A.I. breakdown due to LLMs having been trained on bad A.I. slop generated by other LLMs.

The worst example seen, pretended to be GPTBot/1.2. One would expect at least a basic level of competence from the people who wrote that bot—especially because when asking ChatGPT itself about this, it will clearly explain the same principles as are listed below. I can only hope that all the crap I see in my logs that bear this bot's signature, are actually an imposter that also happens to run from Microsoft cloud servers. However, I haven't seen this one lately, now it is mostly meta-externalagent requesting broken paths.

Here is a refresher on how relative URLs like ./ or ../somedir/somepage.html actually work.

The Very Basics

Anyone writing software or scripts of any kind should know this.

It is debatable whether / actually is a relative or absolute URL; this may vary depending on the context. However, forget about these semantics, just make sure the path handling itself is correct.

How to Treat Relative URLs on Web Pages

For anyone writing software that performs requests on the internet, it should be common knowledge that:

then those new paths must be resolved relative to the path of the redirected page, not to the original URL that produced the redirect.

The same goes for files referred from within another file. Relative paths must always be resolved against the path of the file in which they occur.
For instance a webpage links to a CSS file that contains paths to background images. If those paths are relative, they must be resolved relative to the URL of the CSS file, not of the page that links to the CSS file.

Examples

Basic path handling

Redirect example

If:

  1. a web crawling bot fetches a URL:
    https://www.stuffthings.com/oldpath/
  2. and the server responds with a 301 Moved Permanently, with header:
    Location: https://www.stuffthings.com/redirected/
  3. and the bot then fetches the redirected URL:
    https://www.stuffthings.com/redirected/
  4. and this redirected page contains a link:
    <A HREF="images/bla.png">

then the bot must fetch the image at:
https://www.stuffthings.com/redirected/images/bla.png

Conclusion

It is not rocket science. This is how relative paths have worked since the dawn of time. If you do not understand these basic principles, then please do not write some crawler that is then let loose on the entire internet.

If you just use standard path handling routines or libraries in your favourite programming language, you cannot make mistakes as you would when going the idiotic route of implementing your own path handling. For instance in Python, one can rely on urllib:

from urllib.parse import urljoin
source_uri = "https://www.somedomain.com/stuff/things.html"
link_uri = "../gizmos/doodads.html"
destination = urljoin(source_uri, link_uri)

# destination is now: "https://www.somedomain.com/gizmos/doodads.html"
Dr. Lex, 2025-06 - 2026-01
Creative Commons Licence
This work is licensed under a Creative Commons CC0 1.0 Universal License.