BenchmarkDataNLP.jl, a Julia package for quickly generating synthetic, complexity controlled NLP benchmark datasets. Without having to search for large corpora or cleaning messy CSVs etc. GitHub repo β https://github.com/mantzaris/BenchmarkDataNLP.jl
Key aspect: the complexity of the production are controllable. The vocabulary, grammar production number, punctuation number and more are all linearly proportional to a complexity parameter that can range from 1 to 100. At complexity 100 the vocabulary size is 10K words (complexity 1 has 5 letters and 10 words). Different generators can be used, Context Free Grammars, template strings, RDF, or even a finite state machine production approach. The number of sentences that can be produced are decided upon and testing and training/validation datasets are produced automatically. Aspects of the grammar like polysemy can be controlled or not.
ex:
# generate a dataset using Context Free Grammar, at complexity 20, 1000 sentences (800 lines in training, 100 testing, 100 validation) at the path you choose the files to be generated, eg. "/home/user/Documents"
generate_corpus_CFG(complexity = 20,
num_sentences = 1_000,
enable_polysemy = false,
output_dir = "/home/user/Documents",
base_filename = "MyDataset")
This writes three .jsonl
files:
MyDataset_train.jsonl # 800 lines
MyDataset_test.jsonl # 100 lines
MyDataset_valid.jsonl # 100 lines
that inside look like:
{"text":"κ°κ°κ° κ° κ° κ° κ°κ°κ°."}
Each line is UTF-8 textβby default drawn from the Hangul block to avoid accidental real-word bias: