One of the great controversies that artificial intelligence has generated since its arrival is the use of private user data for its training. This is something that now reaches the AI โโโโof the Bluesky social network, but that the company wants to deny.
First of all, I’m sure many of you already know that Bluesky has positioned itself as a clear alternative for all those who have become frustrated with . However, it is now facing significant controversy due to the training of the AI โโmodels it uses.
Its training is based on the decentralized AT protocol which supposedly offers users more control and transparency. But a recent incident has shown that being an open source and decentralized platform has its drawbacks.
A machine learning expert compiled a data set corresponding to one million Bluesky messages using the social network’s Firehose API. This data set was not anonymized, but included user content along with decentralized identifiers, allowing it to be tracked. Its goal was to support machine learning research and experimentation with social media data.
As it could not be otherwise, all this information was published on Bluesky, which has not sat well with most users. Many expressed their opposition to AI training with their posts, a stance that actually coincides with Bluesky’s policy.
Bluesky denies using user data in AI
What’s more, those responsible for the platform itself explicitly state that they do not use user content to train generative AI models. But of course, this set of data that we mentioned before became a major point of controversy and unleashed a wave of criticism. Thus, users argued that their publications were being used without their consent, thus violating the principles on which Bluesky was founded.
Ultimately, the extracted information was removed from the platform. However, while Bluesky itself claims that it does not use user publications to train its AI, the truth is that its public and open source architecture allows third parties to use that data freely. This is something that includes the purposes to which the platform and its users are strongly opposed.
For example, saying that Bluesky’s Firehose API streams all public posts in real time, which is key to creating the detected dataset. Although it is a feature designed to improve transparency and innovation, it also opens the doors to possible misuse, as you can imagine.
The irony of all this is that many users abandoned platforms like X to prevent their content from being used for artificial intelligence training. Bluesky, with its decentralized model, seemed like the antidote to this. But now, users realize that decentralization does not protect them from third parties doing whatever they want with their public data.