Schema Support
Overview
Conduit can manage the structure and format of data as it moves through the pipeline. This makes it possible to take advantage of the benefits that the associated type information provides, such as:
- Data Integrity: Ensuring that data adheres to the expected structure, reducing the risk of errors and inconsistencies.
- Type Safety: Retaining type information throughout the data pipeline, allowing for safe and accurate data processing.
- Future-Proofing: Preparing the system to handle evolving data structures, making it easier to adapt to changes without significant disruptions.
Additionally, this solves the problem of transferring type information in standalone connectors. Namely, Conduit and standalone connectors communicate via Protobuf messages that have a limited set of types.
Every record that contains structured data can be associated with a schema. The Apache Avro™ format is supported and support for more schema types is planned.
Since both, a record's key and payload, can be structured, schemas can be associated with either. Schemas are not part of a record for performance reasons. Instead, a record's metadata contains information about the key schema's subject and version as well the payload schema's subject and version. The schemas themselves are managed by the schema registry.
Conduit's Connector SDK and Processor SDK make it possible to:
- automatically extract a schema from a record's key or payload (note that the data has to be structured)
- automatically encode or decode a record's key or payload (i.e. connectors and processors can work with structured data that contain correct types without being involved in fetching the schemas and encoding/decoding the data themselves)
- work directly with the schema registry (for example, in cases where automatic schema extraction isn't enough and a schema needs to be built manually using the information from a source)
When developing / testing your connector, the schemas will be created in memory by the sdk, so that you can simulate interacting with conduit without having to worry about starting up a conduit instance. When using sdk.Serve the connector will use the grpc API instead. The schema fetches are cached, but if by any chance you want to customize that behaviour you can override the schema.Service variable.
More information about how to work with schemas can be found in the relevant pages for source connectors, destination connectors and processors.
To learn more about configuring source and destination connectors to automatically extract the schema from the key and payload of a record, check out the schema extraction configuration parameters.
Schema Registry
Conduit uses a schema registry to store the schemas of records. You can either configure Conduit to use the built-in schema registry (default) or to connect to an external standalone service exposing a REST API that's compatible with Confluent's Schema registry.
To use an external schema registry, you can use the following configuration options in your Conduit configuration file.
schema-registry:
type: confluent
confluent:
connection-string: http://localhost:8081
Or when running Conduit from the command line you can use the following flags:
$ ./conduit
-schema-registry.type=confluent
-schema-registry.confluent.connection-string=http://localhost:8081
When using a built-in schema registry, schemas will be stored in the same persistence layer as the rest of Conduit's data. By default, it uses BadgerDB, but it can be configured to use PostgreSQL, SQLite, or in-memory storage. More information on storage.
You can check the GitHub repository for the schema registry here.