If possible convert those files to compressed parquet, and apply sorting and partitioning to them.
I’ve gotten 10-100gb csv files down to 300mb-5gb sizes just by doing that
That makes searching and scanning so much faster, and you can do this all with open source free software like polars and ibis.
That is cool! I’ve been wanting I’ve wanted to use a model like this but haven’t really looked.
Are you self hosting the long context llm, of do what are you using?
Context lengths are what kill a lot of my local llm experiments.