Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question: Are the unused parts of a json schema automatically removed ? #243

Closed
fungiboletus opened this issue Apr 25, 2019 · 7 comments
Closed

Comments

@fungiboletus
Copy link

Hi,

I have a fairly large json schema (~3mb). For various reasons, I modify the json schema at runtime before loading it with gojsonschema.NewGoLoader. My changes to the json schema are not that delicate and large parts of the json schema are actually not used anymore.

On another topic, the memory usage of my validators created by gojsonschema is very high and I would like to reduce it.

I could try to edit my json schema more carefully before loading it, to reduce the memory usage. Perhaps it's something already done by gojsonschema so I don't have to do it. Is it the case ? Do you think it would be easy and realistic for a third party contributor to update this library to automatically remove/free the objects created in a jsonschema that are not relevant, if it's not already done ?

@johandorland
Copy link
Collaborator

The entire schema is parsed, because beforehand there is no way of knowing which part of the schemas will and will not be used as a $ref can turn op everywhere which can point to anywhere. If your unused parts of the schemas are in places like a "definitions" you should be safe however. Nothing explicitly removed, it's just not included. If you have schema like:

{
  "if" : false,
   "then" : {
     
   },
   else" : {
       
   }
}

then you're out of luck. You might think you could cull the "then" branch, but once you take into account the "$ref" mechanics that is such a can of worms schema culling is almost impossible and just not worth the effort.

The thing with "definitions" is however that by spec it is required to contain a valid schema. Gojsonschema will parse every subschema in a "definitions" block and check whether it's a valid schema only to throw away the resulting schema later. Because of how referencing mechanics work, "definitions" is not a special case and when a "$ref" points inside it it will actually reparse the schema, even if it had done so earlier 😬 This double work is inefficient and might lead to increased memory usage if the garbage collector doesn't free up the disregarded schemas fast enough.

A different issue that will lead to increased memory usage is the fact that you use a GoLoader. First the Go struct is converted to JSON and then that JSON is parsed into a map[string]interface{} internally from which then the final schema is created. This means your 3mb schema will be present at least 4 times in memory: The raw go struct, the JSON created from the go struct, the internal map[string]interface{} from the parsed JSON and the final gojsonschema.Schema.

There are probably a couple of areas that could be more efficient with memory using general Go optimizations, which anyone with Go knowledge should be able to improve. A smart optimizer that dynamically removes part of the schema like you suggested is just nearly impossible because of the way JSON schema works.

@fungiboletus
Copy link
Author

fungiboletus commented Apr 25, 2019

Thanks for the explanation. I underestimated the difficulty of solving the generic problem.

After reading about how it works, I will try to simplify my model before parsing it. It contains about 150 subschemas in its "definitions" but more than 10 000 "$ref": "#/definitions/name references… This must stress the garbage collector a little bit. It's the FHIR json schema if you are curious about it, but I'm planning to not always support the contained property.

About the memory usage of the GoLoader, I have the following code :

goLoader := gojsonschema.NewGoLoader(data)
schemaLoader := gojsonschema.NewSchemaLoader()
schema, err := schemaLoader.Compile(goLoader)
if err != nil {
	return err
}

someMap[name] = schema

return nil

Do you think the gojsonschema.Schema keeps the references or it should eventually only be the last structure in memory ?

@johandorland
Copy link
Collaborator

Actually the 10.000 identical $refs are not be the problem. The first time "$ref": "#/definitions/name is encountered it will cause #/definitions/name to be parsed, the second ref to #/definitions/name will just reuse the schema the schema from the first time.

The gojsonschema.Schema is definitely not the only reference as you will always have the original data that you put in the goLoader (goLoader itself doesn't deep copy the data so that should be fine) The JSON might be garbage collectable after the schema is created, but the map[string]interface{} stuff definitely lingers around for a couple of reasons. That JSON is only ~3mb however so that's not going to lead to be the problem.

The biggest memory hog probably is the gojsonschema.Schema itself. This is the way a JSON schema is represented internally. If you'd assume every field in that struct is a pointer, you have 42 pointers which is 336 bytes. That's pretty inefficient and considering the sheer amount of subschema contained within that schema and you probably should multiply that 3mb to get the size of the final gojsonschema.Schema.

I just ran a quick test to see how much a program uses that does nothing but load in your 3mb schema. It uses ~35mb of memory before running runtime.GC() and 21mb after. This is a 7 times the size of the original schema. I ran the same test with a different 4mb schema and there the numbers were ~35 before gc and ~30 after.

Is that in the same order of magnitude you are experiencing?

@fungiboletus
Copy link
Author

fungiboletus commented Apr 26, 2019

Good that the definitions are not parsed each time they are encountered. I realised that it make sense to parse them only once, since we can have circular references.

I indeed have this order of magnitude when I load the schema once. However, since the "discriminator" produces fuzzy errors I decided some time ago to remove it and create a different gojsonschema.Schema for each type I use in the discriminator. So the memory usage is quite a lot higher with all my duplicated schemas 😅

~30mb of memory usage is actually fine for me, but the more types I support the higher memory usage I get. Perhaps I could use something different than "discriminator" (is it standard ? I can't really find the documentation about it) by still loading the schema only once.

@johandorland
Copy link
Collaborator

"discriminator" is an OpenAPI thing, this library only does JSON schema, the JSON schema equivalent seems to be "oneOf" in this case. The schema uses a couple of large "oneOf", which are notoriously bad for getting useful error messages out of as noted in issue #214. If you want to solve that problem yourself you can create ~150 schemas from scratch, but then you shouldn't be surprised you run into memory problems 😆 .

If you really want to go down that road the correct approach would be to load the schema into a schemaloader once and then compiling the 150 schemas from that one schemaloader. Currently schemas are not cached between different compilations, but if you'd modify the Compile function here in such a way that the reference pool is shared between multiple invocations of Compile and Compile directly returns cached schemas the end result would be quite efficient.

Then your final code would look something like:

jsonLoader := gojsonschema.NewReferenceLoader("file:///path/to/schema/or/w/e")
schemaLoader := gojsonschema.NewSchemaLoader()
schemaLoader.AddSchemas(jsonLoader)

//range over all the properties in definitions
for i := range .... {
 schema, err := schemaLoader.Compile(gojsonschema.NewReferenceLoader("http://hl7.org/fhir/json-schema/4.0#/definitions/" + i))
}

@fungiboletus
Copy link
Author

Yes I decided to create the many schemas because of #214. I simply didn't expect the memory usage to be that high 😄

Thank you very much for your long comment and the suggestion ! I will try to implement that soon.

@fungiboletus
Copy link
Author

fungiboletus commented May 30, 2019

So I implemented the changes and it's great !

In short, it went from 1.5 second and 680mb of memory allocations to 0.1s and 40mb of memory allocations (before garbage collector).

Optimization ns/op B/op allocs/op
No optimization 1692131290 ns/op 677610484 B/op 7422829 allocs/op
Reference Pool Singleton Optimization 1328775280 ns/op 533190852 B/op 5780825 allocs/op
AddSchemas+ReferenceLoader Optimization 529403585 ns/op 196413161 B/op 2083911 allocs/op
Both Optimizations 98896308 ns/op 38946589 B/op 355560 allocs/op

I decided to make the schemaReferencePool a singleton, because the patch will be quite small to maintain.

schemaReferencePool.go's patched newSchemaReferencePool

var schemaReferencePoolSingleton *schemaReferencePool

func newSchemaReferencePool() *schemaReferencePool {

	if schemaReferencePoolSingleton == nil {
		schemaReferencePoolSingleton = &schemaReferencePool{}
		schemaReferencePoolSingleton.documents = make(map[string]*subSchema)
	}

	return schemaReferencePoolSingleton
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants