Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OrderedDict not needed, and question and comment #53

Open
PallHaraldsson opened this issue Dec 13, 2024 · 0 comments
Open

OrderedDict not needed, and question and comment #53

PallHaraldsson opened this issue Dec 13, 2024 · 0 comments

Comments

@PallHaraldsson
Copy link

PallHaraldsson commented Dec 13, 2024

A.
@vthorsteinsson I see you added OrderedDict (and OrderedSet) in late 2019, when 3.6 was around without dict then not ordered by default.

If you only support 3.7 and higher, then it seems you can simplify the code, not sure if it will be faster, hopefully:

https://stackoverflow.com/questions/1653970/does-python-have-an-ordered-set

The answer is no, but as of Python 3.7 you can use the simple dict from the Python standard library with just keys (and values as None) for the same purpose.

946ffc7

https://docs.python.org/3/library/collections.html

Ordered dictionaries are just like regular dictionaries but have some extra capabilities relating to ordering operations. They have become less important now that the built-in dict class gained the ability to remember insertion order (this new behavior became guaranteed in Python 3.7).

[make sure to read the rest there.]

https://deepsource.com/blog/python-performance-three-easy-tips

When initializing a new dictionary, using {} is much more performant than calling the dict built-in.

https://stackoverflow.com/questions/18422995/why-is-ordereddict-10x-slower-than-dict-and-list

B.
I was looking up tokenization [list], i.e. BPE etc. for LLMs, and dropped in on your repo my accident. Such tokenizers were made first for English, mostly made optimal for it, and Icelandic and German an afterthought if that, Chinese has at least been worked on. I agree with Karpathy, I want tokenizers gone, at least in the long run, they are a solution, but also a problem for current LLMs. Do you do any work on such/LLMs?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant