Data Synchronization
One of the main design goals of atproto (the "Authenticated Transfer Protocol") is to reliably distribute public content between independent network services. This data transfer should be trustworthy (cryptographicly authenticated) and relatively low-latency even at large scale. It is also important that new participants can join the network at any time and “backfill” prior content.
This section describes the major data synchronization features in atproto. The primary real-time data synchronization mechanism is repository event streams, commonly referred to as "firehoses". The primary batch data transfer mechanism is repository exports as CAR files. These two mechanisms can be combined in a "bootstrapping" process which result in a live-synchronized copy of the network.
Synchronization Primitives
As described in the repository spec, each commit to a repository has a revision number, in TID syntax. The revision number must always increase between commits for the same account, even if the account is migrated between hosts or has a period of inactivity in the network. Revision numbers can be used as a logical clock to aid synchronization of individual accounts. To keep this simple, it is recommended to use the current time as a TID for each commit, including the initial commit when creating a new account. Services should reject or ignore revision numbers corresponding to future timestamps (beyond a short fuzzy time drift window). Network services can track the commit revision for every account they have seen, and use this to verify synchronization progress. Services which synchronize data can include the most-recently-processed revision in HTTP responses to API requests from the account in question, in the Atproto-Repo-Rev
HTTP response header. This allows clients (and users) to detect if the response is up-to-date with the actual repository, and detect any problems with synchronization.
Firehose
The repository event stream (com.atproto.sync.subscribeRepos
, also called the "firehose") is an Event Stream which broadcasts updates to repositories (#commit
events), handles and DID documents (#identity
), and account hosting status (#account
). PDS hosts provide a single stream with updates for all locally-hosted accounts. "Relays" are network services which subscribe to one or more repo streams (eg, multiple PDS instances) and aggregate them in to a single combined repo stream. The combined stream has the same structure and event types. A Relay which aggregates nearly all accounts from nearly all PDS instances in the network (possibly through intermediate relays) outputs a “full-network” firehose. Relays often mirror and can re-distribute the repository contents, though their core functionality is to verify content and output a unified firehose.
In most cases the repository data synchronized over the firehose is self-certifying (contains verifiable signatures), and consumers can verify content without making additional requests directly to account PDS instances. It is possible for services to redact events from the firehose, such that downstream services would not be aware of new content.
Identity and account information is not self-certifying, and services may need need to verify independently. This usually means independent DID and handle resolution. Account hosting status might also be checked at account PDS hosts, to disambiguate hosting status at different pieces of infrastructure.
The event message types are declared in the com.atproto.sync.subscribeRepos
Lexicon schema, and are summarized below. A few fields are the same for all event types (except for repo
vs did
for #commit
events):
seq
(integer, required): used to ensure reliable consumption, as described in Event Streamsdid
/repo
(string with DID syntax, required): the account/identity associated with the eventtime
(string with datetime syntax, required): an informal and non-authoritative estimate of when event was received. Intermediate services may decide to pass this field through as-is, or update to the current time
#identity
Events
Indicates that there may have been a change to the indicated identity (meaning the DID document or handle), and optionally what the current handle is. Does not indicate what changed, or reliably indicate what the current state of the identity is.
Event fields:
seq
(integer, required): same for all event typesdid
(string with DID syntax, required): same for all event typestime
(string with datetime syntax, required): same for all event typeshandle
(string with handle syntax, optional): the current handle for this identity. May behandle.invalid
if the handle does not currently resolve correctly.
Presence or absence of the handle
field does not indicate that it is the handle which has changed.
The semantics and expected behavior are that downstream services should update any cached identity metadata (including DID document and handle) for the indicated DID. They might mark caches as stale, immediately purge cached data, or attempt to re-resolve metadata.
Identity events are emitted on a "best-effort" basis. It is possible for the DID document or handle resolution status to change without any atproto service detecting the change, in which case an event would not be emitted. It is also possible for the event to be emitted redundantly, when nothing has actually changed.
Intermediate services (eg, relays) may chose to modify or pass through identity events:
- they may replace the handle with the result of their own resolution; or always remove the handle field; or always pass it through unaltered
- they may filter out identity events if they observe that identity has not actually changed
- they may emit identity events based on changes they became aware of independently (eg, via periodic re-validation of handles)
#account
Events
Indicates that there may have been a change in Account Hosting status at the service which emits the event, and what the new status is. For example, it could be the result of creation, deletion, or temporary suspension of an account. The event describes the current hosting status, not what changed.
Event Fields:
seq
(integer, required): same for all event typesdid
(string with DID syntax, required): same for all event typestime
(string with datetime syntax, required): same for all event typesactive
(boolean, required): whether the repository is currently available and can be redistributedstatus
(string, optional): string status code which describes the account state in more detail. Known values include:takendown
: indefinite removal of the repository by a service provider, due to a terms or policy violationsuspended
: temporary or time-limited variant oftakedown
deleted
: account has been deactivated, possibly permanently.deactivated
: temporary or indefinite removal of all public data by the account themselves.
When coming from any service which redistributes account data, the event describes what the new status is at that service, and is authoritative in that context. In other words, the event is hop-by-hop for repository hosts and mirrors.
See the Account Hosting specification for more details.
#commit
Events
This event indicates that there has been a new repository commit for the indicated account. The event usually contains the "diff" of repository data, in the form of a CAR slice. See the Repository specification for details on "diffs" and the CAR file format.
See the Repository specification for more details around repo diffs.
Event Fields:
seq
(integer, required): same for all event typesrepo
(string with DID syntax, required): the same asdid
for all other event typestime
(string with datetime syntax, required): same for all event typesrev
(string with TID syntax, required): the revision of the commit. Must match therev
in the commit block itself.since
(string with TID syntax, nullable): indicates therev
of a preceding commit, which the the repo diff contains differences fromcommit
(cid-link, required): CID of the commit object (inblocks
)tooBig
(boolean, required): if true, indicates that the repo diff was too large, and thatblocks
,ops
, and completeblobs
are not all includedblocks
(bytes, required): CAR "slice" for the corresponding repo diff. The commit object must always be included.ops
(array of objects, required): list of record-level operations in this commit: specific records created, updated, deletedblobs
(array of cid-link, required): set of new blobs (by CID) referenced by records in this commit
Commit events are broadcast when the account repository changes. Commits can be "empty", meaning no actual record content changed, and only the rev
was incremented. They can contain a single record update, or multiple updates. Only the commit object, record blocks, and MST tree nodes are authenticated (signed): the since
, ops
, blobs
, and tooBig
fields are not self-certifying, and could in theory be manipulated, or otherwise be incorrect or incomplete.
If since
is not included, the commit should include the full repo tree, or set the tooBig
flag.
If the tooBig
flag is set, then the amount of updated data was too much to be serialized in a single stream event message. Downstream services which want to maintain complete synchronized copies for the repo need to fetch the diff separately, as discussed below.
Firehose Validation Best Practices
A service which does full validation of upstream events has a number of properties to track and check. For example, Relay instances should fully validate content from PDS instances before re-broadcasting.
Here is a summary of validation rules and behaviors:
- services should independently resolve identity data for each DID. They should ignore
#commit
events for accounts which do not have a functioning atproto identity (eg, lacking a signing key, or lacking a PDS service entry, or for which the DID has been tombstoned) - services which subscribe directly to PDS instances should keep track of which PDS is authoritative for each DID. They should remember the host each subscription (WebSocket) is connected to, and reject
#commit
events for accounts if they come from a stream which does not correspond to the current account for that DID - services should track account hosting status for each DID, and ignore
#commit
events for events which are notactive
- services should verify commit signatures for each
#commit
event, using the current identity data. If the signature initially fails to verify, the service should refresh the identity metadata in case it had recently changed. Events with conclusively invalid signatures should be rejected. - services should reject any event messages which exceed reasonable limits. A reasonable upper bound for producers is 5 MBytes (for any event stream message type). The
subscribeRepos
Lexicon also limitsblocks
to one million bytes, andops
to 200 entries. Commits with too much data must use thetooBig
mechanism, though such commits should generally be avoided in the first place by breaking them up in to multiple smaller commits. - services should verify that repository data structures are valid against the specification. Missing fields, incorrect MST structure, or other protocol-layer violations should result in events being rejected.
- services may apply rate-limits to identity, account, and commit events, and suspend accounts or upstream services which violate those limits. Rate limits might also be applied to recovery modes such as invalid signatures resulting in an identity refresh,
tooBig
events, missing or out-of-order commits, etc. - services should ignore commit events with a
rev
lower or equal to the most recent processedrev
for that DID, and should reject commit events with arev
corresponding to a future timestamp (beyond a clock drift window of a few minutes) - services should check the
since
value in commit events, and if it is not consistent with the previous seenrev
for that DID (see discussion in "Reliable Synchronization"), mark the repo as out-of-sync (similar to atooBig
commit event) - data limits on records specifically should be verified. Events containing corrupt or entirely invalid records may be rejected. for example, a record not being CBOR at all, or exceeding normal data size limits.
- more subtle data validation of records may be enforced, or may be ignored, depending on the service. For example, unsupported CID hash types embedded in records should probably be ignored by Relays (even if they violate the atproto data model), but may result in the record or commit event being rejected by an AppView
- mirroring services, which retain a full copy of repository data, should verify that commit diffs leave the MST tree in a complete and valid state (eg, no missing records, no invalid MST nodes, commit CID would be reproducible if the MST structure was re-generated from scratch)
- Relays (specifically) should not validate records against Lexicons
Reliable Synchronization
This section describes some details on how to reliably subscribe to the firehose and maintain an existing synchronized mirror of the network.
Services should generally maintain a few pieces of state for all accounts they are tracking data from:
- track the most recent commit
rev
they have successfully processed - keep cached identity data, and use cache expiration to ensure periodic re-validation of that data
- track account status
Identity caches should be purged any time an #identity
event is received. Additionally, identity resolution should be refreshed if a commit signature fails to verify, in case the signing key was updated but the identity cache has not been updated yet.
When tooBig
events are emitted on the firehose, downstream services will need to fetch the diff out-of-band. This usually means an API request to the com.atproto.sync.getRepo
endpoint on the current PDS host for the account, with the since
field included. The since
value should be the most recently processed rev
value for the account, which may or may not match the since
field in the commit event message.
If a #commit
is received with a since
that does not match the most recently processed rev
for the account, and is “later” (higher value) than the most recent commit rev
the service has processed for that account, the service may need to do the same kind of out-of-band fetch as for a tooBig
event.
Services should keep track of the seq
number of their upstream subscriptions. This should be stored separately per-upstream, even if there is only a single Relay connection, in case a different Relay is subscribed to in the future (which will have different seq
numbers).
Events can be processed concurrently, but they should be processed sequentially in-order for any given account. This can be accomplished by partitioning multiple workers using the repo DID as a partitioning key.
Services can confirm that they are consuming content reliably by fetching a snapshot of repository DIDs and rev
numbers from other services, including PDS hosts and Relay instances. After a short delay, these can be compared against the current state of the service to identify any accounts which have lower than expected rev
numbers. These repos can then be updated out-of-band.
Bootstrapping a Live Mirror
The firehose can be used to follow new data updates, and repo exports can be used for snapshots. Actually combining the two to bootstrap a complete live-updating mirror can be a bit tricky. One approach is described below.
Keep a sync status table for all accounts (DIDs) encountered. The status can be:
dirty
: there is either no local repo data for this account, or it has gotten out of syncin-process
: the repo is "dirty", but there is a background task in process to update itsynchronized
: a complete copy of the repository has been processed
Start by subscribing to the full firehose. If there is no existing repository data for the account, mark the account as "dirty". When new events come in for a repo, the behavior depends on the repo state. If it is "dirty", the event is ignored. If the state is "synchronized", the event is immediately processed as an update to the repo. If the state is "in-process", the event is enqueued locally.
Have a set of background workers start processing "dirty" repos. First they mark the status as in-process
, so that new events are enqueued locally. Then the full repo export (CAR file) is fetched from the PDS and processed in full. The commit rev
of the repo export is noted. When the full repo import is complete, the worker can start processing any enqueued events, in order, skipping any with a rev
lower than the existing repo processed rev
(as is the usual behavior). When the queue for the account is fully processed, the state can be flipped to synchronized
, and the worker can move on.
After some time, most of the known accounts will be marked as synchronized
, though this will only represent the most recently active accounts in the network. Next a more complete set of repositories in the network can be fetched, for example using an API query against an existing large service. Any new identified accounts can be marked as dirty
in the service, and the background workers can start processing them.
When all of the accounts are synchronized
, the process is complete. At large scale it may be hard to get perfect synchronization: PDS instances may be down at various times, identities may fail to resolve, or invalid events, data, or signatures may end up in the network.
Security Concerns
General mitigations for resource exhaustion attacks are recommended: event rate-limits, data quotas per account, limits on data object sizes and deserialized data complexity, etc.
Care should always be taken when making network requests to unknown or untrusted hosts, especially when the network locators for those host from from untrusted input. This includes validating URLs to not connect to local or internal hosts (including via HTTP redirects), avoiding SSRF in browser contexts, etc.
To prevent traffic amplification attacks, outbound network requests should be rate-limited by host. For example, identity resolution requests when consuming from the firehose, including DNS TXT traffic volume and DID resolution requests.
Future Work
The subscribeRepos
Lexicon is likely to be tweaked, with deprecated fields removed, even if this breaks Lexicon evolution rules.
The event stream sequence/cursor scheme may be iterated on to support sharding, timestamp-based resumption, and easier failover between independent instances.
Alternatives to the full authenticated firehose may be added to the protocol. For example, simple JSON serialization, filtering by record collection type, omitting MST nodes, and other changes which would simplify development and reduce resource consumption for use-cases where full authentication is not necessary or desired.