Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create a geoparquet module and add dependencies #855

Merged
merged 4 commits into from
May 30, 2024
Merged

Conversation

bchapuis
Copy link
Member

No description provided.

Copy link
Contributor

@Drabble Drabble left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for creating this draft so quickly. I just added 2 small questions to the PR.

baremaps-geoparquet/pom.xml Show resolved Hide resolved
pom.xml Show resolved Hide resolved
@bchapuis
Copy link
Member Author

Previous attempt #851

@bchapuis bchapuis force-pushed the 849-geoparquet branch 5 times, most recently from 2095d8d to 0c476c6 Compare May 23, 2024 06:28
@bchapuis
Copy link
Member Author

@sebr72 I just pushed the GeoParquetGroup interface, which I think would be a decent abstraction to access the records of a geoparquet file. It contains getters and setters for the main geopackage types. It also describes an augmented schema for the GeoPackage schema. It uses a sealed GeoParquetGroup.Schema interface that would facilitate introspection by the users of the API. It would be nice to have your feedback.

In terms of organisation, once this interface is stabilized, I can work on a postgis importer with mocks and you can work on the geopackage reader internals. Is that ok for you?

@Drabble It would be nice to have your feedback as well.

pom.xml Show resolved Hide resolved
configuration.set("fs.s3a.aws.credentials.provider",
"org.apache.hadoop.fs.s3a.AnonymousAWSCredentialsProvider");
configuration.setBoolean("fs.s3a.path.style.access", true);
configuration.setBoolean(AvroReadSupport.READ_INT96_AS_FIXED, true);
Copy link
Contributor

@Drabble Drabble May 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe that INT96 is deprecated. https://stackoverflow.com/questions/55829202/unable-to-read-date-format-columns-int96-type-from-avro-parquet-schema-in-apac. Did you get some INT96 fields in the geoparquet files?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can probably discuss organising these files further into sub-directories. FileInfo and DoubleValue don't really belong together.

return configuration;
}

private static URI getRootUri(URI uri) throws URISyntaxException {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a standard way to do it?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, this is a quick and dirty attempt at supporting wildcards. I will put a TODO ;)

Comment on lines +80 to +81
super(file, writeSupport, compressionCodecName, blockSize, pageSize,
pageSize, enableDictionary, enableValidation, writerVersion, conf);

Check notice

Code scanning / CodeQL

Deprecated method or constructor invocation Note

Invoking
ParquetWriter.ParquetWriter
should be avoided because it has been deprecated.
@Drabble
Copy link
Contributor

Drabble commented May 25, 2024

I did a little bit of reading on the spec of geoparquet. https://github.com/opengeospatial/geoparquet/blob/main/format-specs/geoparquet.md

I have a feeling that in our library what we do is extract the data from a generic parquet file and provide in addition the Geoparquet metadata. We don't really do Geoparquet specifc processing except from the metadata parsing.

We could provide extra functions to parse the WKB/GeoArrow binary into a Java class that loads up geometries and sets the correct CRS. For Baremaps, I don't think this is really useful as we store raw WKB in the database. But it could be useful for someone using the library. Maybe this should be considered in a second iteration?

The big question is, should we return a Logical type called Geometry instead of the binary type? This could also be valid for other logical types like STRING or Date. And should we rename the Primitive class into something else as String/Geometry are not primitives.

Regarding the geoparquet metadata, I believe we should provide this metadata with a function like geoParquetReader.getGeoParquetMetadata() instead of including it inside each GeoParquetGroupImpl object. The Geoparquet metadata should be the same for each record, as long as the files are valid.

As for writing to Geoparquet files, I think this is not useful for Baremaps. We should maybe consider it as a second or third step. I think there are a lot of question about supporting the entire specification and validating schemas if we try to implement that.

Some things to consider for support are:

  • Single-geometry type encodings based on the GeoArrow specification instead of WKB.
  • Validating input geoparquet files. For example in the spec, there should never be geometries inside nested objects.
  • Using more complex types for GeoParquetColumnMetadata. For example the edges field could be an enum. Name of the coordinate system for the edges. Must be one of "planar" or "spherical". The default value is "planar".

@bchapuis bchapuis force-pushed the 849-geoparquet branch 3 times, most recently from 7f321ee to 455dccc Compare May 27, 2024 19:25
import org.apache.baremaps.testing.TestFiles;
import org.junit.jupiter.api.Test;

class GeoParquetDataSchemaTest {

Check notice

Code scanning / CodeQL

Unused classes and interfaces Note test

Unused class: GeoParquetDataSchemaTest is not referenced within this codebase. If not used as an external API it should be removed.

// Iterate over all the files in the path
for (FileStatus file : fileSystem.globStatus(globPath)) {
ParquetFileReader reader = ParquetFileReader.open(configuration, file.getPath());

Check notice

Code scanning / CodeQL

Deprecated method or constructor invocation Note

Invoking
ParquetFileReader.open
should be avoided because it has been deprecated.
}

public MessageType getParquetSchema() throws IOException, URISyntaxException {
return files().values().stream()
Copy link
Contributor

@Drabble Drabble May 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think we will have a different Geoparquet DataTable for each subtheme?

I believe each sub theme will have a different schema. https://docs.overturemaps.org/schema/reference/

For example water inside the base theme has a is_salt property and land has an elevation property.

That would make a total of 16 tables with the current Overture Maps spec.

for (int i = 0; i < fields.size(); i++) {
Field field = fields.get(i);
field.type();
switch (field.type()) {

Check warning

Code scanning / CodeQL

Missing enum case in switch Warning

Switch statement does not have a case for
LONG
.
@bchapuis bchapuis force-pushed the 849-geoparquet branch 2 times, most recently from 02ef525 to e05898f Compare May 30, 2024 10:11
@bchapuis bchapuis force-pushed the 849-geoparquet branch 2 times, most recently from 0f33525 to 8eb6763 Compare May 30, 2024 11:46
Copy link

Quality Gate Failed Quality Gate failed

Failed conditions
E Reliability Rating on New Code (required ≥ A)

See analysis details on SonarCloud

Catch issues before they fail your Quality Gate with our IDE extension SonarLint

@bchapuis bchapuis merged commit a838fd0 into main May 30, 2024
8 of 9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants