photog.social is one of the many independent Mastodon servers you can use to participate in the fediverse.
A place for your photos and banter. Photog first is our motto Please refer to the site rules before posting.

Administered by:

Server stats:

250
active users

#crawler

0 posts0 participants0 posts today

Nachdem diverse #ki #ai #crawler besonders respektvoll mit den öffentlichen Ressourcen von Open Source Projekten umgehen, habe ich mich dazu entschlossen eben diese auszusperren. Wir hatten in der Vergangenheit crawls, die im #monitoring als #ddos gewertet wurden.

Diverse AS erfreuen sich nun einem dauerhaften 429, einige wenige die es für alle kaputt machen…

🔗 RE: "Please stop externalizing your costs directly into my face"

AI training is controversial at best. If you say AI is trained fair you're either very blind to the reality of things, or very naive - or both. None of the big AI tools are trained ethically, and this example from SourceHut just shows it.

👉 kevingimbel.de/link-blog/re-ht

kevingimbel.dePlease stop externalizing your costs directly into my face | kevingimbel.de
More from kevin ⁂ (he/him)

Artificial intelligence companies are creating incredibly large scale denial of service situations on the infrastructure of Open Source Networks.

Now Network owners need to waste time on Finding ways of sending All These requests of the rogue AI insects to /dev/null

@altbot

#DDoS #DenialOfService #AI #LLM #KDE #crawler #programming #Alibaba #IP #FOSS #attack #OpenSource

thelibre.news/foss-infrastruct

It looks like LLM-producing companies that are massively #crawling the #web require the owners of a website to take action to opt out. Albeit I am not intrinsically against #generativeai and the acquisition of #opendata, reading about hundreds of dollars of rising #cloud costs for hobby projects is quite concerning. How is it accepted that hypergiants skyrocket the costs of tightly budgeted projects through massive spikes in egress traffic and increased processing requirements? Projects that run on a shoestring budget and are operated by volunteers who dedicated hundreds of hours without any reward other than believing in their mission?

I am mostly concerned about the default of opting out. Are the owners of those projects required to take action? Seriously? As an #operator, it would be my responsibility to methodically work myself through the crawling documentation of the hundreds of #LLM #web #crawlers? I am the one responsible for configuring a unique crawling specification in my robots.txt because hypergiants make it immanently hard to have generic #opt-out configurations that tackle LLM projects specifically?

I reject to accept that this is our new norm. A norm in which hypergiants are not only methodically exploiting the work of thousands of individuals for their own benefit and without returning a penny. But also a norm, in which the resource owner is required to prevent these crawlers from skyrocketing one's own operational costs?

We require a new #opt-in. Often, public and open projects are keen to share their data. They just don't like the idea of carrying the unpredictable, multitudinous financial burden of sharing the data without notice from said crawlers. Even #CommonCrawl has safe-fail mechanisms to reduce the burden on website owners. Why are LLM crawlers above the guidelines of good #Internet citizenship?

To counter the most common argument already: Yes, you can deny-by-default in your robots.txt, but that excludes any non-mainstream browser, too.

Some concerning #news articles on the topic:

Continued thread

Der Nachteil der obigen "harten" #Paywall: Suchmaschinen finden den Artikel auch nicht. Also haben die Online-Medien schon vor Jahrzehnten begonnen, diese anders zu behandeln. Eine Möglichkeit ist, die #Crawler von #Google & Co gezielt zu erkennen und ihnen trotzdem den ganzen Text zu servieren.

Eine andere Möglichkeit ist die "weiche" (soft) Paywall: Da liefert der Webserver den ganzen Text an alle Anfrager; der Browser wird aber mittels JavaScript angewiesen, ihn zu verstecken.

Hier die #NZZ

Ecco come un problema che abbiamo riscontrato sul nostro server Poliverso.org ci ha fatto accorgere dell’invasione dei crawler che Meta sta sguinzagliando per il Web con l’obittivo di addestrare la sua intelligenza artificiale. E i media italiani, muti!

https://www.informapirata.it/2024/10/10/metastasiverso-vs-poliverso-ecco-come-linformazione-italiana-ha-ignorato-il-piu-grande-sversamento-di-rifiuti-compiuto-da-facebook-nel-web/

Interesting...

I just got a follow request from #Awakari, with its own detailed and formatted message, and it never occurred to me that that's the sort of thing that #ActivityPub supports, though it does make sense.

I'm also a little surprised because my understanding has been that the main reason they are controversial is because they don't get permission to crawl people's profiles, but here they've sent a customized agreement type thing for me to look at...

Did you know? The Facebook crawler doesn't consider most HTTP return codes except for 403!

If a page returns 403, the Meta crawler won’t access it. Other status codes, like 404 or 451, won’t stop it from attempting to scrape your page’s content. The crawler will continue and won´t apply any delay.