tokeniser-py

A custom tokeniser with a 131,072-token vocabulary derived from 0.5B (val) and 1B (val+test) tokens in SlimPajama. Uses a novel token generation algorithm and a dynamic programming-based segmentation…

Installation

In a virtualenv (see these instructions if you need to create one):

pip3 install tokeniser-py

Releases

Version Released Bullseye
Python 3.9
Bookworm
Python 3.11
Files
0.1.2 2025-03-22    
0.1.1 2025-03-22    
0.1.0 2025-03-22    

Issues with this package?

Page last updated 2025-03-22 20:27:18 UTC