I am seeing an increasing occurrence of web crawling bots whose implementation of relative path handling in hyperlinks is obviously broken. It seems that the people who implemented those bots, have no understanding of how relative paths are supposed to work, or they just suck at programming.
The worst example is GPTBot/1.2. One would expect at least a basic level of competence from the people who wrote that bot, especially because when asking ChatGPT itself about this, it will clearly explain the same principles as are listed below. I can only hope that all the crap I see in my logs that bear this bot's signature, are actually an imposter that also happens to run from Microsoft cloud servers, but somehow I doubt this is the case.
Here is a refresher on how relative URLs like ./
or ../somedir/somepage.html
actually work.
Location
header,then those new paths must be resolved relative to the path of the redirected page, not to the original URL that produced the redirect.
The same goes for files referred from within another file. For instance a webpage links to a CSS file that contains paths to background images. If those paths are relative, they must be resolved relative to the URL of the CSS file, not of the page that links to the CSS file.
./
or .
means:../
or ..
means:/
means:https://www.stuffthings.com/oldpath/
301 Moved Permanently
, with header:Location: https://www.stuffthings.com/newpath/
https://www.stuffthings.com/newpath/
<A HREF="images/bla.png">
then the bot must fetch the image at:
https://www.stuffthings.com/newpath/images/bla.png
https://www.stuffthings.com/oldpath/images/bla.png
https://www.stuffthings.com/images/bla.png
https://www.stuffthings.com/a/p.html
, then a link to ./
must resolve to https://www.stuffthings.com/a/
https://www.stuffthings.com/a/b/q.html
, then a link to ../
must resolve to https://www.stuffthings.com/a/
https://www.stuffthings.com/a/b/q.html
, then a link to /
must resolve to https://www.stuffthings.com/
https://www.stuffthings.com/a/b/q.html
, then a link to ../d/r.html
must resolve to https://www.stuffthings.com/a/d/r.html
/root.html
will become https://www.stuffthings.com/root.html
, no matter how deep you are inside any subpath of the www.stuffthings.com
domain.It is not rocket science. This is how relative paths have worked since the dawn of time. If you do not understand these basic principles, then please do not write some crawler that is then let loose on the entire internet.
If you just use standard path handling routines or libraries in your favourite programming language, you cannot make mistakes as you would do when going the idiotic route of implementing your own path handling. For instance in Python one can rely on urllib
:
from urllib.parse import urljoin source_uri = "https://www.somedomain.com/stuff/things.html" link_uri = "../gizmos/doodads.html" destination = urljoin(source_uri, link_uri) # destination is now: "https://www.somedomain.com/gizmos/doodads.html"
Dr. Lex, 2025-06