BenchmarkDataNLP.jl

mantzaris · May 27, 2025, 9:24pm

BenchmarkDataNLP.jl, a Julia package for quickly generating synthetic, complexity controlled NLP benchmark datasets. Without having to search for large corpora or cleaning messy CSVs etc. GitHub repo → https://github.com/mantzaris/BenchmarkDataNLP.jl

Key aspect: the complexity of the production are controllable. The vocabulary, grammar production number, punctuation number and more are all linearly proportional to a complexity parameter that can range from 1 to 100. At complexity 100 the vocabulary size is 10K words (complexity 1 has 5 letters and 10 words). Different generators can be used, Context Free Grammars, template strings, RDF, or even a finite state machine production approach. The number of sentences that can be produced are decided upon and testing and training/validation datasets are produced automatically. Aspects of the grammar like polysemy can be controlled or not.

ex:

# generate a dataset using Context Free Grammar, at complexity 20, 1000 sentences (800 lines in training, 100 testing, 100 validation) at the path you choose the files to be generated, eg. "/home/user/Documents"
generate_corpus_CFG(complexity = 20, 
                    num_sentences = 1_000, 
                    enable_polysemy = false, 
                    output_dir = "/home/user/Documents", 
                    base_filename = "MyDataset")

This writes three .jsonl files:
MyDataset_train.jsonl # 800 lines
MyDataset_test.jsonl # 100 lines
MyDataset_valid.jsonl # 100 lines

that inside look like:

{"text":"갃갇갊 갆 갇 갆 갃가갇."}

Each line is UTF-8 text—by default drawn from the Hangul block to avoid accidental real-word bias:

Topic		Replies	Views
LLM AI just for Julia? A proposal: Julia plus science LLM? General Usage machine-learning	4	1616	June 24, 2023
An LLM fine-tuned for Julia, call for comments + help Tooling llm , generative-ai	31	3325	July 16, 2025
Generate text for testing General Usage	6	999	April 11, 2017
Voicebot / NLP packages Specific Domains nlp	1	282	September 8, 2022
A new LLM benchmark for Julia programming Tooling generative-ai	0	223	May 21, 2025

BenchmarkDataNLP.jl

Related topics