Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pydantic BaseModel vs. @dataclass for configuration #6015

Closed
csmith49 opened this issue Jan 3, 2025 · 3 comments · Fixed by #6176
Closed

Pydantic BaseModel vs. @dataclass for configuration #6015

csmith49 opened this issue Jan 3, 2025 · 3 comments · Fixed by #6176
Labels
enhancement New feature or request

Comments

@csmith49
Copy link
Collaborator

csmith49 commented Jan 3, 2025

What problem or use case are you trying to solve?

I found myself (on #5306) needing to extend the agent configuration. Because some fields needed validation and the configuration was a discriminated union, it felt natural to use a Pydantic BaseModel. However, we don't use BaseModel objects anywhere else for configuration: the agent, app, and LLM configuration are all @dataclass implementations.

(We do already have pydantic as a dependency and seem to use BaseModel objects as drop-in replacements for @dataclass)

Pydantic offers a lot of utility in the BaseModel and surrounding infrastructure, but you maximize that utility when objects are BaseModel implementations all the way down. Instead of mixing @dataclass with BaseModel, it might be best to pick one strategy and commit.

Describe the UX of the solution you'd like

No change needs to happen, but it would be good to get some consensus on when to use BaseModel and when to use @dataclass to help standardize future contributions.

Do you have thoughts on the technical implementation?

I'm biased towards allowing Pydantic in future contributions. Below are three of the config "gotchas" I ran into while working on #5306: they're not large issues, but Pydantic trivially resolves what would otherwise require some extra context to work around.

First: Configuration with structured attributes require updates to serialization. If I add a new field to LLMConfig that has structure (like the optional draft_editor field), I have to update the to_safe_dict and from_dict to handle the recursive dictionary dumping/loading.

Pydantic BaseModel objects will automatically dump/load fields that are themselves BaseModel instances.

Second: Configuration with secret fields has two separate serialization paths. There's the standard data class conversions:

from dataclasses import asdict

# Two normal ways to get dict representations
asdict(config)
config.__dict__

Which sometimes work just fine but reveal sensitive fields. I know that's why to_safe_dict is there, but on more than one instance OpenHands wrote some code that used the standard data class conversions when logging which could have leaked an API key.

Pydantic offers the SecretStr type, which when used as a field annotation ensures that all representations of that field are starred out (***********) unless a config.field.get_secret_value() method is explicitly called.

Third: Configuration assumes uniform fields. In my PR I added several implementations of a Condenser base class that all needed to be configured and had different but overlapping configuration options, as in:

class Condenser(ABC):
    ...

@dataclass
class CondenserA(Condenser):
    x: int = 0
    y: int = 0
    
@dataclass
class CondenserB(Condenser):
    x: int = 0
    y: int = 0

@dataclass
class CondenserC(Condenser):
    x: int = 1
    z: str = ""

Assuming I need to pass the fields x, y, and z, to the condensers to initialize them, there are a few strategies for managing the configuration:

  1. Make a big CondenserConfig object that has optional fields for x, y, and z and some logic for building the Condenser{A, B, C} instances from the big object. Made tricky by the overlap and inconsistent defaults, and pushes validation down to condenser initialization.
  2. Make smaller config objects -- one for each condenser implementation -- that subclass from the big CondenserConfig object. Need to add some extra structure for unambiguous serialization: just looking at the fields isn't enough to distinguish between CondenserA and CondenserB.
  3. Use Pydantic's support for discriminated unions to automatically provide the extra structure mentioned in 2.
@csmith49 csmith49 added the enhancement New feature or request label Jan 3, 2025
@enyst
Copy link
Collaborator

enyst commented Jan 3, 2025

Previous discussion: #5306 (comment)

@neubig
Copy link
Contributor

neubig commented Jan 4, 2025

One vote from me for unifying on pydantic!

@csmith49
Copy link
Collaborator Author

csmith49 commented Jan 9, 2025

Took a crack at an implementation in #6176 if y'all want to provide comments or feedback there.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants