Relative URL / URI Paths for Beginners

I am seeing an increasing occurrence of web crawling bots whose implementation of relative path handling in hyperlinks is obviously broken. It seems that the people who implemented those bots, have no understanding of how relative paths are supposed to work, or they just suck at programming.

The worst example is GPTBot/1.2. One would expect at least a basic level of competence from the people who wrote that bot, especially because when asking ChatGPT itself about this, it will clearly explain the same principles as are listed below. I can only hope that all the crap I see in my logs that bear this bot's signature, are actually an imposter that also happens to run from Microsoft cloud servers, but somehow I doubt this is the case.

Here is a refresher on how relative URLs like ./ or ../somedir/somepage.html actually work.

It should be common knowledge that:

when accessing a URL that responds with a 301, 307 or other redirect,
and then fetching the redirected page specified in the response's Location header,
and then constructing new paths from hyperlinks with relative paths occurring in this redirected page,

then those new paths must be resolved relative to the path of the redirected page, not to the original URL that produced the redirect.

The same goes for files referred from within another file. For instance a webpage links to a CSS file that contains paths to background images. If those paths are relative, they must be resolved relative to the URL of the CSS file, not of the page that links to the CSS file.

Also:

A relative URL ./ or . means:
“construct the target URL by removing the last characters of the current URL until you encounter a slash / character.”
In other words, go to the root of the current level.
A relative URL ../ or .. means:
“construct the target URL by removing the last characters of the current URL until you have removed one slash, and then continue removing until you encounter a second slash.”
In other words, go up one level.
A relative URL / means:
“go to the root of the domain.”

Examples

A web crawling bot fetches a URL:
https://www.stuffthings.com/oldpath/
and the server responds with a 301 Moved Permanently, with header:
Location: https://www.stuffthings.com/newpath/
and the bot fetches the redirected URL:
https://www.stuffthings.com/newpath/
and this redirected page contains a link:
<A HREF="images/bla.png">

then the bot must fetch the image at:
https://www.stuffthings.com/newpath/images/bla.png

and NOT at https://www.stuffthings.com/oldpath/images/bla.png
and ABSOLUTELY NOT at https://www.stuffthings.com/images/bla.png

As for the other types of links:

If the current URL is https://www.stuffthings.com/a/p.html, then a link to ./ must resolve to https://www.stuffthings.com/a/
If the current URL is https://www.stuffthings.com/a/b/q.html, then a link to ../ must resolve to https://www.stuffthings.com/a/
If the current URL is https://www.stuffthings.com/a/b/q.html, then a link to / must resolve to https://www.stuffthings.com/
Of course these can be combined with extra target paths, for instance if the current URL is https://www.stuffthings.com/a/b/q.html, then a link to ../d/r.html must resolve to https://www.stuffthings.com/a/d/r.html
Or, a link to /root.html will become https://www.stuffthings.com/root.html, no matter how deep you are inside any subpath of the www.stuffthings.com domain.

It is not rocket science. This is how relative paths have worked since the dawn of time. If you do not understand these basic principles, then please do not write some crawler that is then let loose on the entire internet.

If you just use standard path handling routines or libraries in your favourite programming language, you cannot make mistakes as you would do when going the idiotic route of implementing your own path handling. For instance in Python one can rely on urllib:

from urllib.parse import urljoin
source_uri = "https://www.somedomain.com/stuff/things.html"
link_uri = "../gizmos/doodads.html"
destination = urljoin(source_uri, link_uri)

# destination is now: "https://www.somedomain.com/gizmos/doodads.html"

Dr. Lex, 2025-06