Pydantic BaseModel vs. @dataclass for configuration #6015

csmith49 · 2025-01-03T19:20:09Z

What problem or use case are you trying to solve?

I found myself (on #5306) needing to extend the agent configuration. Because some fields needed validation and the configuration was a discriminated union, it felt natural to use a Pydantic BaseModel. However, we don't use BaseModel objects anywhere else for configuration: the agent, app, and LLM configuration are all @dataclass implementations.

(We do already have pydantic as a dependency and seem to use BaseModel objects as drop-in replacements for @dataclass)

Pydantic offers a lot of utility in the BaseModel and surrounding infrastructure, but you maximize that utility when objects are BaseModel implementations all the way down. Instead of mixing @dataclass with BaseModel, it might be best to pick one strategy and commit.

Describe the UX of the solution you'd like

No change needs to happen, but it would be good to get some consensus on when to use BaseModel and when to use @dataclass to help standardize future contributions.

Do you have thoughts on the technical implementation?

I'm biased towards allowing Pydantic in future contributions. Below are three of the config "gotchas" I ran into while working on #5306: they're not large issues, but Pydantic trivially resolves what would otherwise require some extra context to work around.

First: Configuration with structured attributes require updates to serialization. If I add a new field to LLMConfig that has structure (like the optional draft_editor field), I have to update the to_safe_dict and from_dict to handle the recursive dictionary dumping/loading.

Pydantic BaseModel objects will automatically dump/load fields that are themselves BaseModel instances.

Second: Configuration with secret fields has two separate serialization paths. There's the standard data class conversions:

from dataclasses import asdict

# Two normal ways to get dict representations
asdict(config)
config.__dict__

Which sometimes work just fine but reveal sensitive fields. I know that's why to_safe_dict is there, but on more than one instance OpenHands wrote some code that used the standard data class conversions when logging which could have leaked an API key.

Pydantic offers the SecretStr type, which when used as a field annotation ensures that all representations of that field are starred out (***********) unless a config.field.get_secret_value() method is explicitly called.

Third: Configuration assumes uniform fields. In my PR I added several implementations of a Condenser base class that all needed to be configured and had different but overlapping configuration options, as in:

class Condenser(ABC):
    ...

@dataclass
class CondenserA(Condenser):
    x: int = 0
    y: int = 0
    
@dataclass
class CondenserB(Condenser):
    x: int = 0
    y: int = 0

@dataclass
class CondenserC(Condenser):
    x: int = 1
    z: str = ""

Assuming I need to pass the fields x, y, and z, to the condensers to initialize them, there are a few strategies for managing the configuration:

Make a big CondenserConfig object that has optional fields for x, y, and z and some logic for building the Condenser{A, B, C} instances from the big object. Made tricky by the overlap and inconsistent defaults, and pushes validation down to condenser initialization.
Make smaller config objects -- one for each condenser implementation -- that subclass from the big CondenserConfig object. Need to add some extra structure for unambiguous serialization: just looking at the fields isn't enough to distinguish between CondenserA and CondenserB.
Use Pydantic's support for discriminated unions to automatically provide the extra structure mentioned in 2.

The text was updated successfully, but these errors were encountered:

enyst · 2025-01-03T19:46:12Z

Previous discussion: #5306 (comment)

neubig · 2025-01-04T04:18:29Z

One vote from me for unifying on pydantic!

csmith49 · 2025-01-09T19:46:22Z

Took a crack at an implementation in #6176 if y'all want to provide comments or feedback there.

csmith49 added the enhancement New feature or request label Jan 3, 2025

csmith49 mentioned this issue Jan 3, 2025

feature: Condenser Interface and Defaults #5306

Merged

1 task

csmith49 mentioned this issue Jan 9, 2025

Config objects as Pydantic BaseModels #6176

Merged

1 task

neubig closed this as completed in #6176 Jan 12, 2025

csmith49 mentioned this issue Jan 16, 2025

Pydantic-based configuration and setting objects #6321

Merged

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pydantic BaseModel vs. @dataclass for configuration #6015

Pydantic BaseModel vs. @dataclass for configuration #6015

csmith49 commented Jan 3, 2025

enyst commented Jan 3, 2025

neubig commented Jan 4, 2025

csmith49 commented Jan 9, 2025

Pydantic BaseModel vs. @dataclass for configuration #6015

Pydantic BaseModel vs. @dataclass for configuration #6015

Comments

csmith49 commented Jan 3, 2025

enyst commented Jan 3, 2025

neubig commented Jan 4, 2025

csmith49 commented Jan 9, 2025