fair enough, the did:web flows are not documented even for technical atproto developers, and there needs to be a self-serve way to heal identity/account problems elsewhere in the network (the "burn" problem).
I do think that did:plc provides more pragmatic freedom and control than did:web for most folks, though the calculus might be different for institutions or individuals with a long-term commitment to running their own network services. But did:web should be a functional alternative on principle.
I'm glad that the PDS was easy to get up and running, and that the author was able to find a supportive community on discord.
Thanks for responding, Brian. While I don't agree with a lot of decisions Bluesky and the broader ATProto community have made, I am very excited that progress towards real decentralization is happening; Blacksky's app view, for instance, was the trigger for me to try to finally try to set up an account. I would love to see more of a focus on the parts of the system that make this difficult, so that myself and other people who are tired of coupling ourselves to centralized systems can participate. It's hard for me to trust that this is the direction the community is interested in moving, but I hope you prove me wrong.
Because of your blog post I went through the process of setting up a did:web account myself this afternoon, and it was painful. Eg, I found a bug in our Go SDK causing that "deactivated" error (https://github.com/bluesky-social/indigo/pull/1281). I kept notes and will try to get out a blog post and update to 'goat' soon.
We've also been making progress on the architecture and governance of the PLC system. I don't know if those will assuage all concerns with that system immediately, but I do think they are meaningful steps in reducing operational dependency on Bluesky PBC.
You can host your own instance, but resolving forks is not self-authenticating and requires some central trust (because of the 72 hour rollback window for higher priority rotation keys). Not counting that, you could essentially run your own fully independent instance where the worst that could happen is that you lack some newer updates to people's did documents (but anyone can upload them since they're self-authenticating). Some people do run their own instances for caching reasons, but these just ingest operations from the official one.
In terms of "credible exit", if the community at large could decide to move to a different PLC host, it would be technically possible for everyone to switch over.
Worth mentioning that Bluesky PBC is relinquishing legal control over the PLC and spinning it off into its own entity based in Switzerland.[1]
While did:plc was intended to be centralised from the start and under open governance (https://docs.bsky.app/blog/plc-directory-org), did: provided a framework to adopt other key resolution methods.
As part of the IETF work (https://docs.bsky.app/blog/taking-at-to-ietf) this is a hotly debated area and I’d expect some solid evolution to happen as part of that process, super encourage anyone interested to get involved there!
I wrote a Bluesky app in preparation for a client project. ATProto is over-engineered for my purposes, though probably justifiably carefully engineered for the purposes of a big social Twitter-like thing. But since I didn't have to do the engineering, so what? It's a very solid platform for many kinds of multi-user information-sharing systems.
This article does give me the impression that I should make and use more test accounts than I currently do when mucking around with ATProto/Bluesky.
DNS poisoning is a concern in some situations, but not always.
The common case with at:// strings is to put in the DID in the "authority" slot, not the handle. In that case, whether resolving a did:plc or did:web, there are regular HTTPS hostnames involved, and web PKI. So the DNS poisoning attacks are the same as most web browsing.
If you start from a handle, you do use DNS. If the resolution is HTTPS well-known, then web PKI is there, but TXT records are supported and relatively common.
What we currently recommend is for folks to resolve handles server-side, not on end devices. Or if they do need to resolve locally, use DoH to a trusted provider. This isn't perfect (server hosting environments could still be vulnerable to poisoning), but cuts the attack surface down a lot.
DNSSEC is the current solution to this problem. But we feel like mandating it would be a big friction. We have also had pretty high-profile incidents in our production network caused by DNSSEC problems by third parties. For example, many active US Senators use NAME.senate.gov as a form of identity verification. The senate.gov DNS server uses DNSSEC, and this mostly all worked fine, until their DNSSEC configuration broke, which resulted in dozens of senators showing up in the Bluesky app as "invalid handle". This was a significant enough loss of trust in the DNSSEC ecosystem that we haven't pushed on it since. I think if we saw another application-layer protocol require it, and get successful adoption, we'd definite reconsider.
Thanks, but it didn’t work very well for me. It’s all politics. I suspect some of the people follow have bad taste, so this is the wrong way to find stuff I’m interested in.
If you want a service which indexes every post in the public network, including from folks you don't follow, that is just going to require resources. I think $200/month for a full-network index (as zepplin does) is very reasonable and approachable for organized groups without external funding. Many Mastodon instances cost more than that, and provide a must smaller scope of indexing.
If you want a small scaled down setup for just a small community, which still interoperates with the full network but doesn't have a complete network, there are setups like AppViewLite, which can run on, eg, an old laptop at home: https://github.com/alnkesq/AppViewLite
Personally, I don't think individualist self-hosting is a necessary or helpful goal for indexing the network. Most humans are not interested in spending the time or learning the skills to do this, even if it was as easy as setting up a self-hosted blog with RSS. I think small collectives (orgs, coops, communities, neighborhoods, companies, etc) exist and can fill this role.
Regardless, this is moving the discussion, which was about whether it was possible to decentralize each component the network, not whether it was pragmatic for individuals to self-host the whole thing.
> I think $200/month for a full-network index (as zeppelin does) is very reasonable and approachable for organized groups without external funding.
I didn't know about these recent attempts, they're impressive for sure. However they write[1] about zeppelin:
"The cost to run this is about US $200/mo, primarily due to the 16 terabytes of storage it currrently uses"
So when you here give that $200/mo cost as a price point for "organized groups", you are forecasting that the cost of storage will go down as fast as the BlueSky data size grows? At what rate right now is the data size growing? Because the last numbers I saw were something like 2TB, so it being already 16TB sounds like $200/mo is not going to be enough very soon.
hundreds (thousands?) of users have signed up for Bluesky Social, then moved their accounts to independent hosts. folks can use https://zeppelin.social/ as a totally free-standing bluesky posting experience that interoperates with the full network.
Bluesky Social still clearly dominates the ecosystem, but there is no single component of the system that does not have a open/alternative option for exit.
Do you disagree? Is there a specific centralized component you take issue with?
> there is no single component of the system that does not have a open/alternative option for exit.
Users can move their follows, followers and posts to zeppelin.social fron BlueSky transparently?
Now you can of course debate on what "decenttalized" means, but in a social network easy migration between servers is the crucial feature that would allow the decentralized network to emerge.
Edit. Does the network actually work over at zeppelin.social alone if Bluesky servers go down?
Yes, all of those social graph relationships are hinged off a permanent identifier (DID) and everything comes along when accounts migrate PDS instances. Folks can use zeppelin.social from any PDS instance. The DID PLC directory is currently hosted by Bluesky, but the directory can be forked, and did:web identifiers can be used as an alternative (and several independence-minded folks in the network do so).
Migration between servers is so seamless that is causes confusion and doubt that the protocol even supports migration, because there is basically zero in-app visibility of which users are on which server.
Yes, the network continues to work on zeppelin.social if Bluesky servers are down.
> Now you can of course debate on what "decenttalized" means, but in a social network easy migration between servers is the crucial feature that would allow the decentralized network to emerge.
I totally agree. However, a lot of people in the fediverse/ActivityPub world apparently (?) disagree, seeing as your domain is tightly coupled to your server, i.e. no name portability. Seems like a wild oversight to me, and getting massive instances like matrix.org and mastadon.social seems like an inevitable consequence.
Lack of name portability implies greater risk when choosing a server. Greater risk when choosing a server means choosing comparatively less risky servers. Choosing comparatively less risky servers means choosing more well-known servers. Thus you have the GMail-ification of the fediverse.
Even if set aside the details on dependence on bluesky infrastructure, the effort to host “all components” is quite expensive and technology-intensive with significant cost for storage and compute. For example, a deployment of “all the things” (just for you) is in the ballpark of 70-100€/month because the way things are designed to work. And that’s not even factoring the burden of managing the whole range of technologies involved.
Making it hard to setup or run, complex to understand or change are also forms of discouraging independent use.
zeppelin.social just gives me a black page with a stylized yellow scarab on it on desktop Safari, Mac Firefox, and Mac Chrome, with or without adblock.
I agree with the general sentiment here, but don't like the examples. 200 photos per person per year isn't very much! That is all fine.
What really bloats things out is surveillance (video and online behavioral) and logging/tracking/tracing data. Some of this ends up cold, but a lot of it is also warm, for analytics. It bloats CPU/RAM/network, which is pretty resource intensive.
The cost is justified because the margins of big tech companies are so wildly large. I'd argue those profits are mostly because of network effects and rentier behavior, not the actual value in the data being stored. If there was more competition pressure, these systems could be orders of magnitude more efficient without any significant different in value/quality/outcome, or really even productivity.
SeaweedFS does the thing: I've used it to store billions of medium-sized XML documents, image thumbnails, PDF files, etc. It fills the gap between "databases" (broadly defined; maybe you can do few-tens-KByte docs but stretching things) and "filesystems" (hard/inefficient in reality to push beyond tens/hundreds of millions of objects; yes I know it is possible with tuning, etc, but SeaweedFS is better-suited).
The docs and operational tooling feel a bit janky at first, but they get the job done, and the whole project is surprisingly feature-rich. I've dealt with basic power-outages, hardware-caused data corruption (cheap old SSDs), etc, and it was possible to recover.
In some ways I feel like the surprising thing is that there is such a gap in open source S3 API blob stores. Minio is very simple and great, but is one-file-per-object on disk (great for maybe 90% of use-cases, but not billions of thumbnails). Ceph et al are quite complex. There are a bunch of almost-sort-kinda solutions like base64-encoded bytes in HBase/postgresql/etc, or chunking (like MongoDB), but really you just want to concatenate the bytes like a .tar file, and index in with range requests.
The Wayback Machine's WARC files plus CDX (index files with offset/range) is pretty close.
A lot of people regard GCs as something one should not use for low level components like file systems and databases. So that this performs so well might be the surprise for GP.
Which is annoying, as there are various GC systems that are near, or even equal to, performance of comparable non-GC systems.
(I personally blame Java for most of this)
Yes and no. While for most application, the GC is hardly an issue and is fast enough, the problem is for application where you need to be able to control exactly when and how memory/objects will be freed. These will never do well with any form of GC. But a looot of software can perform perfectly fine with a GC. If anything, it is mostly Go error handling that is the bigger issue...
You can often tell a system is written in Go when it locks up with no feedback. Go gives the illusion that concurrency is easy, but it simply makes it easy to write fragile concurrent systems.
A common pattern is that one component crashes because of a bug or a misconfiguration, then the controlling component locks up because it can't control the crashed component, and then all the other components lock up because they can't communicate with the locked up controller.
Anyway that's my experience with several Go systems. Of course it's more a programming issue than a deficiency in Go itself. Though I think the way errors are return values that are easily ignored and are frustrating to deal with encourage this sort of lax behavior.
For not disciplined devs (…) it can easily eat errors. Linters catch some of that and of course you can also do that in exception based languages but in those you have to really explicitly put catch {} which is a code smell while missing an error check in go is easier to just ‘forget’. I actually like the go way but not that it’s easy to forget handling; that’s why I prefer a Haskell/Idris return error (monad) way like Go but making it impossible to pass the result without explicitly testing for errors more.
I was quite surprised to discover that minio is one file per object. Having read some papers about object stores, this is definitely not what I expected.
For many small objects a generic filesystem can be less efficient than a more specialised store. Things are being managed that aren't needed for your blob store, block alignment can waste a lot of space, there are often inefficiencies in directories with many files leading to a hierarchical splitting that adds more inefficiency through indirection, etc. The space waste is mitigated somewhat by some filesystems by supporting partial blocks, or including small files directly in the directory entry or other structure (the MFT in NTFS) but this adds an extra complexity.
The significance of these inefficiencies will vary depending on your base filesystem. The advantage of using your own storage format rather than naively using a filesystem is you can design around these issues taking different choices around the trade-offs than a general filesystem might, to produce something that is both more space efficient and more efficient to query and update for typical blob access patterns.
The middle ground is using a database rather than a filesystem is usually a compromise: still less efficient than a specially designed storage structure, but perhaps more so than a filesystem. They tend to have issues (it just inefficiencies) with large objects though, so your blob storage mechanism needs to work around those or just put up with them. A file-per-object store may have a database also anyway, for indexing purposes.
A huge advantage of one file per object is simplicity of implementation. Also for some end users the result (a bunch of files rather than one large object) might better fit into their existing backup strategies¹. For many data and load patterns, the disadvantages listed above may hardly matter so the file-per-object approach can be an appropriate choice.
--
[1] Assuming they are not relying on the distributed nature of the blob store² which is naive³ age doesn't protect you against some thinks a backup does unless the blob store implements features to help out there (minimum distributed duplication guarantee any given peice of data, keeping past versions etc).
[2] Also note that not all blob stores are distributed, and many are but support single node operation.
[3] Perhaps we need a new variant if the "RAID is not a backup" mantra. "Distributed storage properties are not, by themselves, a backup" or some such.
The other commenter already outlined the main trade-offs, which boils down to increased latency and storage overhead for the file-per-object model. As for papers, I like the design of Haystack.
When you had corruption and failures, what was the general procedure to deal with that? I love SeaweedFS and want to try it (Neocities is a nearly perfect use case), but part of my concern is not having a manual/documentation for the edge cases so I can figure things out on the fringes. I didn't see any documentation around that when I last looked but maybe I missed something.
(If any SeaweedFS devs are seeing this, having a section of the wiki that describes failure situations and how to manage them would be a huge add-on.)
This is exactly the problem that the Internet Archive created their Scholar project to mitigate (https://scholar.archive.org/about). The https://fatcat.wiki component acts as a dashboard to track preservation of scholarly publications across multiple efforts. There are a bunch of projects in this area, including LOCKSS ("lots of copies keep stuff safe", including some fun/novel uses of cryptography), Scielo and similar regional platforms and archives (primarily outside the US/EU), Pubmed Central, etc. Zenodo (CERN) and figshare end up being an accessible option for some small journals. There are definitely gaps that content falls through and gets lost.
A few folks have mentioned shadow libraries like Sci-Hub. These efforts can play an archival role, but tend to focus on access, which means there is not as much attention on content which is freely available today, but could disappear in the future.
A common dynamic here is that clout and funding flows to globally prestigious publications, and there is a bias against marginal publications. For sure there are many content-farms and scammy publications, but a lot of gems and valuable small publications get bundled in and dismissed.
I do think that did:plc provides more pragmatic freedom and control than did:web for most folks, though the calculus might be different for institutions or individuals with a long-term commitment to running their own network services. But did:web should be a functional alternative on principle.
I'm glad that the PDS was easy to get up and running, and that the author was able to find a supportive community on discord.