Monday, June 10, 2024

AI Scraping My Blog?

My Stat-Counter account has been showing this frequent Hong Kong visitor:


Total Sessions usually records how many times the computer has visited, but it says only 1, even though there are five total hits on this one page (of 20 hits) on the Stat Counter report.  It's been showing up frequently for weeks now.
I know, I said five, but they are scattered.  The one on top is one.  Here are three more and there was one more.  


I've had this sort of thing before, but it's been awhile.  In the past, the assumption was they were scraping content.  Now, I'm wondering if it isn't an AI bot gathering stuff for training.  If so, what should I do and how?  From Duda.

"How to Block AI Crawlers from Crawling your Site

Some site owners are choosing to block AI crawlers, such as ChatGPT and Bard from crawling their site in order to prevent it from learning from or using their website content. You can block these AI user-agents in a similar manner as you would block Google crawlers; by replacing the default robots.txt file with a new file that specifies disallow rules for specific AI user-agents."

When I first started blogging, I spent a lot of time learning about (and blogging about) technical aspects of blogging - how to:find out if anyone is reading the blog; to embed photos and videos; how to change the format; how to add an email address; etc.  

Now AI is raising other issues.  Such as how to block AI crawlers from using your site to train its bots.  

This is not what I want to spend my time on.  First the internet is telling me I have to block each crawler separately by adding code to the robot.txt file.  


Should You Block AI Tools From Accessing Your Website?

Unfortunately, there’s no simple way to block all AI bots from accessing your website, and manually blocking each individual bot is almost impossible. Even if you keep up with the latest AI bots roaming the web, there’s no guarantee they’ll all adhere to the commands in your robots.txt file. 

 From Google Search Central:

"You can control which files crawlers may access on your site with a robots.txt file.

A robots.txt file lives at the root of your site. So, for site www.example.com, the robots.txt file lives at www.example.com/robots.txt. robots.txt is a plain text file that follows the Robots Exclusion Standard. A robots.txt file consists of one or more rules. Each rule blocks or allows access for all or a specific crawler to a specified file path on the domain or subdomain where the robots.txt file is hosted. Unless you specify otherwise in your robots.txt file, all files are implicitly allowed for crawling."


That means I have to find the robots.text file and add stuff and hope I do it just right so I don't screw something else up.  But this site also warns:

"If you use a site hosting service, such as Wix or Blogger [That's me], you might not need to (or be able to) edit your robots.txt file directly. Instead, your provider might expose a search settings page or some other mechanism to tell search engines whether or not to crawl your page."

Of course I don't want to block search engines for browsers or only subscribers will ever see my posts.  

So I'm asking myself, is this worth the time it's going to take to figure this out.  Well, someone else asked that too.

"The real question here is whether the results are worth the effort, and the short answer is (almost certainly) no."

Here's another one saying the same thing:

"At the end of the day blocking ChatGPT and other generative AI crawlers is really a matter of choice. Depending on your website’s purpose and/or your business model it may make sense to. But in my opinion the vast majority of sites have nothing to fear from allowing AI crawlers to crawl their site."

For now, I want to agree with this advice.  But then I start thinking that this was written by an AI firm that wants to steal your content.   

And I don't even know if that Hong Kong visitor is scraping material for some AI enterprise.  Maybe it's just stealing content.  

Like your car, your house, your garden, your teeth, everything needs some maintenance to keep it functioning.  Clearly my phone and computer do, and this blog does as well, though I've avoided that for some time on the blog.  

I'm now officially putting myself on notice to pay more attention to AI.  


No comments:

Post a Comment

Comments will be reviewed, not for content (except ads), but for style. Comments with personal insults, rambling tirades, and significant repetition will be deleted. Ads disguised as comments, unless closely related to the post and of value to readers (my call) will be deleted. Click here to learn to put links in your comment.