I am seeing an increasing occurrence of web crawling bots whose implementation of relative path handling in hyperlinks is obviously broken. It seems the people who implemented those bots, have no understanding of how relative paths are supposed to work, or they just suck at programming. Or maybe I am witnessing the first signs of A.I. breakdown due to LLMs having been trained on bad A.I. slop generated by other LLMs.
The worst example seen, pretended to be GPTBot/1.2. One would expect at least a basic level of competence from the people who wrote that bot—especially because when asking ChatGPT itself about this, it will clearly explain the same principles as are listed below. I can only hope that all the crap I see in my logs that bear this bot's signature, are actually an imposter that also happens to run from Microsoft cloud servers. However, I haven't seen this one lately, now it is mostly meta-externalagent requesting broken paths.
Here is a refresher on how relative URLs like ./ or ../somedir/somepage.html actually work.
Anyone writing software or scripts of any kind should know this.
./ or . means:../ or .. means:/ means:
It is debatable whether / actually is a relative or absolute URL; this may vary depending on the context. However, forget about these semantics, just make sure the path handling itself is correct.
For anyone writing software that performs requests on the internet, it should be common knowledge that:
Location header,then those new paths must be resolved relative to the path of the redirected page, not to the original URL that produced the redirect.
The same goes for files referred from within another file. Relative paths must always be resolved against the path of the file in which they occur.
For instance a webpage links to a CSS file that contains paths to background images. If those paths are relative, they must be resolved relative to the URL of the CSS file, not of the page that links to the CSS file.
https://www.stuffthings.com/a/p.html, then a link to ./ must resolve to https://www.stuffthings.com/a/https://www.stuffthings.com/a/, then ./ simply points to that same path, because it is a directory.https://www.stuffthings.com/a/b/q.html, then a link to ../ must resolve to https://www.stuffthings.com/a/https://www.stuffthings.com/a/b/q.html, then a link to / must resolve to https://www.stuffthings.com/https://www.stuffthings.com/a/b/q.html, then a link to ../d/r.html must resolve to https://www.stuffthings.com/a/d/r.html because:
../, ending up in directory a/d/r.html/root.html will always represent the location https://www.stuffthings.com/root.html, no matter how deep you are inside any subpath of the www.stuffthings.com domain.If:
https://www.stuffthings.com/oldpath/301 Moved Permanently, with header:Location: https://www.stuffthings.com/redirected/https://www.stuffthings.com/redirected/<A HREF="images/bla.png">
then the bot must fetch the image at:
https://www.stuffthings.com/redirected/images/bla.png
https://www.stuffthings.com/oldpath/images/bla.pnghttps://www.stuffthings.com/images/bla.pngIt is not rocket science. This is how relative paths have worked since the dawn of time. If you do not understand these basic principles, then please do not write some crawler that is then let loose on the entire internet.
If you just use standard path handling routines or libraries in your favourite programming language, you cannot make mistakes as you would when going the idiotic route of implementing your own path handling. For instance in Python, one can rely on urllib:
from urllib.parse import urljoin source_uri = "https://www.somedomain.com/stuff/things.html" link_uri = "../gizmos/doodads.html" destination = urljoin(source_uri, link_uri) # destination is now: "https://www.somedomain.com/gizmos/doodads.html"