Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Study how LM Evaluation Harness works and try to implement it #231

Open
ggerganov opened this issue Mar 17, 2023 · 5 comments
Open

Study how LM Evaluation Harness works and try to implement it #231

ggerganov opened this issue Mar 17, 2023 · 5 comments
Labels
enhancement New feature or request generation quality Quality of model output help wanted Extra attention is needed high priority Very important issue research 🔬

Comments

@ggerganov
Copy link
Member

ggerganov commented Mar 17, 2023

Update 10 Apr 2024: #231 (comment)


It would be great to start doing this kind of quantitative analysis of ggml-based inference:

https://bellard.org/ts_server/

It looks like Fabrice evaluates the models using something called LM Evaluation Harness:

https://github.com/EleutherAI/lm-evaluation-harness

I have no idea what this is yet, but would be nice to study it and try to integrate it here and in other ggml-based projects.
This will be very important step needed to estimate the quality of the generated output and see if we are on the right track.

@ggerganov ggerganov added enhancement New feature or request high priority Very important issue generation quality Quality of model output labels Mar 17, 2023
@ggerganov ggerganov pinned this issue Mar 17, 2023
@anzz1
Copy link
Contributor

anzz1 commented Mar 17, 2023

Half the fun in AI though is not completely understanding why the results are what they are.

I'm only (half) joking though, this will obviously be a good thing. Pitting various models against each other in a common environment seems the right way forward. This would not only help in training better models but also present more options varying in quality, speed and the amount of resources required to run them.

@ggerganov ggerganov mentioned this issue Mar 19, 2023
@gjmulder gjmulder unpinned this issue Mar 27, 2023
@Green-Sky
Copy link
Collaborator

Green-Sky commented Apr 20, 2023

as far as i can tell, you just have to implement a python class for the model.

eg:
https://github.com/EleutherAI/lm-evaluation-harness/blob/master/lm_eval/models/gpt2.py

edit: or here is the "model" apiusage for bellard's textsynth-api.

edit2: someone created an issue on their end EleutherAI/lm-evaluation-harness#417

@StellaAthena
Copy link

Hi! We are quite interested in supporting ggml, but nobody on our team has experience with Python bindings for C AFAIK.

Copying from the issue on our side,

The process would look something like:
-make a new file in lm_eval/models called “ggml_model.py” or similar

in that file make a BaseLM subclass called GGMLLM or similar
This class should do the following:

  • In initialization, instantiate a model using the Python bindings
  • Implement the loglikelihood_rolling(), loglikelihood(), and greedy_until() class methods to support all 3 completion types (see gpt3.py or BaseLM for a template to compare to)
  • add any helper methods for those functions!

We’d be happy to help however we can!

@github-actions github-actions bot added the stale label Mar 25, 2024
Copy link
Contributor

This issue was closed because it has been inactive for 14 days since being marked as stale.

@StellaAthena
Copy link

For the record, we successfully integrated this into the eval harness via llama-cpp-python). Currently it's llama.cpp specific and extending it to the entire ggml ecosystem would be awesome. Our real bottleneck is not being very familiar with using Python bindings (also manpower).

@ggerganov ggerganov reopened this Apr 10, 2024
@ggerganov ggerganov added help wanted Extra attention is needed and removed stale labels Apr 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request generation quality Quality of model output help wanted Extra attention is needed high priority Very important issue research 🔬
Projects
None yet
Development

No branches or pull requests

4 participants