Do All Languages Cost the Same? Tokenization in the Era of Commercial Language Models

Published in 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP 2023), 2023

Download paper here

Recommended citation: [code] Orevaoghene Ahia, Sachin Kumar, Hila Gonen, Jungo Kasai, David R. Mortensen, Noah A. Smith, Yulia Tsvetkov. (2023). “Do All Languages Cost the Same? Tokenization in the Era of Commercial Language Models.” 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP 2023).

Recommended citation: [code] Orevaoghene Ahia, Sachin Kumar, Hila Gonen, Jungo Kasai, David R. Mortensen, Noah A. Smith, Yulia Tsvetkov. (2023). "Do All Languages Cost the Same? Tokenization in the Era of Commercial Language Models." 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP 2023).
Download Paper