Metrics for Text Generation from T5 Model

A user training a T5 model asked for alternative metrics to Exact Match for evaluating text generation. Community members suggested ROUGE-1, ROUGE-2, and BLEU, and recommended Braintrust for running evaluations on small test sets.

Hey guys, I was training a T5 model and noticed that one of the metrics used for evaluation is the Exact Match metric. Is there any other metric that I could possibly use for evaluating text generation from the T5 model? If yes, could you also point me toward resources that would help me implement such metrics? Chrode https://discuss.huggingface.co/u/Chrode 2 hey @Praneet /u/praneet did you solve it? I am looking for the same approach. thanks Praneet https://discuss.huggingface.co/u/Praneet 3 Sadly, I never really got around to it. I see many people just running against popular benchmarks but that won’t work for my task. So I usually create a small test set with 30 to 50 samples that I can run my LLM over and manually evaluate. I heard from a few people behind some of the popular LLMs doing something similar for smaller tasks that don’t have popular ways of evaluating them. @Chrode /u/chrode Hey Praneet, Braintrust is a great tool for running those evaluations on the 30 to 50 samples. We provide a Python/Typescript library to run and log those evals and give you a web UI to visualize improvements/regressions/etc. Use it for free @ https://braintrustdata.com/ https://braintrustdata.com/ avp2 https://discuss.huggingface.co/u/avp2 5 ROUGE-1, ROUGE-2 or BLEU also works