Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Migrate to phf, add no-std support #9

Merged
merged 6 commits into from
Jan 5, 2023
Merged

Conversation

mkroening
Copy link
Contributor

@mkroening mkroening commented Dec 22, 2022

Closes #7.

This PR is structured as follows:

  1. Migrate code generation from generate_table.py to build.rs.

    Before, generate_table.py was called once and the generated code was then committed to the repository.
    Now, build.rs generates the code from assets/code_tables.json on the fly for every user before the compilation. The resulting code is not saved persistently anywhere, and is regenerated after cargo clean.

    This step does not change the generated code or any dependencies.

  2. Migrate to phf.

    This change migrates the code generation for hash maps from the manual lazy_static and ahash::AHashMap based code to a phf_codegen and phf::Map based one.

This commit and the following ones are for supporting no-std environments:

  1. Enable Rust 2021 to enable Cargo resolver version 2.

    Without this, phf_codegen in build-dependencies enables the std
    feature for phf_shared in the normal dependencies scope as well.

    See https://doc.rust-lang.org/cargo/reference/features.html#feature-resolver-version-2

The next two commits are for refactoring all functionality that requires an allocator into one module for easier gating.

  1. Move string functions into string module.

  2. Move TableType string methods to string module.

  3. Add no_std support.

    This makes all functionality depend on the alloc crate instead of std, making it suitable for no-std environments.
    Additionally, the alloc functionality can be disabled via Cargo features for no-std environments without alloc.

@tats-u
Copy link
Owner

tats-u commented Dec 24, 2022

Are you trying to rewrite generate_table.py in Rust? If the build.rs does just what generate_table.py, I think you should rename build.rs as generate_tables.rs for example. build is too generic.

@mkroening mkroening force-pushed the phf branch 2 times, most recently from 161eb9e to 470c35c Compare December 25, 2022 10:56
@mkroening
Copy link
Contributor Author

Are you trying to rewrite generate_table.py in Rust? If the build.rs does just what generate_table.py, I think you should rename build.rs as generate_tables.rs for example. build is too generic.

build.rs is a Cargo build script. It has to be called build.rs. It's executed before compilation of the source files, and I used it to replace generate_table.py in the first commit, since I deemed it most suitable for migrating to phf in the second commit.

I rebased on latest master and applied your early suggestions. I pushed it commit by commit to have CI runs for each one of them.
I'll also update the description of this PR and mark this as ready for review.

Would you like me to split this PR up or are you okay with keeping it this big?

@mkroening mkroening changed the title Migrate to phf Migrate to phf, add no-std support Dec 25, 2022
@mkroening mkroening marked this pull request as ready for review December 25, 2022 11:21
@mkroening
Copy link
Contributor Author

If the build.rs does just what generate_table.py, I think you should rename build.rs as generate_tables.rs for example. build is too generic.

I just found, that you can actually use a different name for the build script: package.build. I have never seen another name than build.rs before though, and I would find things easier myself, if we keep the name build.rs for the build script.

It's up to you to decide, though. :)

build.rs Outdated
Comment on lines 15 to 23
struct CodeTables {
created: String,
tables: Vec<(u16, Table)>,
}
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I want a doc comment that shows this is parsed content of code_tables.json. This is just an draft.

/// Parsed code tables from `code_tables.json`
struct CodeTables {
    /// ISO 8601 Timestamp when tables in the JSON are created
    created: String,
    //// code tables main part
    ////
    //// `([code page], [table definition])`
    tables: Vec<(u16, Table)>,
}

The name ParsedCodeTablesJsonContent might be better because this is not tables itself and has also a timestamp.

Also could you tell me why you think the Vec of the tuple is better than a map (Map, HashMap, or something similar)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I called it CodeTables, because the file is called code_tables.json. I also wanted to avoid a long name.

I chose Vec, because I want the tables to be sorted and stable in the generated output.

Copy link
Owner

@tats-u tats-u Jan 4, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mkroening Autocompletion by modern editors reduces a pain of typing long names a lot, though.

because I want the tables to be sorted and stable in the generated output.

Why not BTreeMap? Also you want to clarify a sort is needed in a comment.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was more concerned about reading, but if you want, I can rename it to ParsedCodeTablesJsonContent.

I chose Vec, because it is faster and simpler if we only sort it once and don't mutate it. BTreeMap would also work fine though and I can migrate to it if you want me to. :)

Copy link
Owner

@tats-u tats-u Jan 5, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mkroening I'm glad to hear your opinion. You don't have to modify.

I chose Vec, because it is faster

BTreeMap has the same order $O(n \log n)$ as a sort of Vec, but I didn't know the combination of Vec and sort is faster.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They should scale the same, and BTreeMap makes sure that the collection is kept sorted. Vec provides better cache locality, since its items are stored in a row. So it's about nice access patterns when iterating. For the BTreeMap, we'd jump around in memory a lot.

All of this is super irrelevant for our use case, since this is not at all performance sensitive, and benchmarks might prove me wrong in this case. So my choice was rather arbitrary and mainly out of habit. :)

@tats-u
Copy link
Owner

tats-u commented Dec 26, 2022

@mkroening I see, and you can go ahead (you can stick to build.rs) if you add sufficient comments.

Enable Rust 2021

We need the version 2 feature resolver for no_std support.
Otherwise, `phf_codegen` in `build-dependencies` enables the `std`
feature for `phf_shared` in the normal `dependencies` scope as well.
https://doc.rust-lang.org/cargo/reference/features.html#feature-resolver-version-2
@mkroening
Copy link
Contributor Author

Sorry for taking so long. I addressed your concerns. :)

@tats-u tats-u merged commit 47af156 into tats-u:master Jan 5, 2023
@mkroening mkroening deleted the phf branch January 5, 2023 15:55
tats-u added a commit that referenced this pull request Feb 20, 2023
- Use `phf` (perfect hash map) instead of `ahash` (#7, #9) by @mkroening
  - Actual type of `OEMCPHashMap` will be changed. I suppose that you do
    not have to modify your code; recompile might be needed (I do not
    know much)
- (Use `clippy` as lints (#8) by @mkroening)
- Remove unused dependency on `hfs_nfd`
@tats-u
Copy link
Owner

tats-u commented Feb 20, 2023

This change is now available in 1.2.0. Sorry to keep you waiting for a long time.

@mkroening
Copy link
Contributor Author

No worries! Thanks a lot! :)

tats-u added a commit that referenced this pull request Feb 21, 2023
- Revert changes (#9) that should have been 2.0.0 (#11)
- Update `ahash`
- Remove unused `hfs_nfd`

Note: #8 is included
tats-u added a commit that referenced this pull request Feb 21, 2023
- Use phf (perfect hash map) instead of ahash (#7, #9) by @mkroening
  Actual type of OEMCPHashMap will be changed.
@tats-u
Copy link
Owner

tats-u commented Feb 21, 2023

Sorry I'd like you to use 2.0.0(-beta.1) instead. I found this patch is a breaking change, contrary to my expectations.

tats-u added a commit that referenced this pull request Feb 27, 2023
- Use `once_cell` instead of `lazy_static` for test
- Comply with Clippy rules
- More small fixes

The following changes has been included since 2.0.0-beta.1.

- Use phf (perfect hash map) instead of ahash (#7, #9) by @mkroening
  Actual type of OEMCPHashMap will be changed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Adapt Rust-PHF
2 participants