BenchmarkDataNLP.jl

BenchmarkDataNLP.jl, a Julia package for quickly generating synthetic, complexity controlled NLP benchmark datasets. Without having to search for large corpora or cleaning messy CSVs etc. GitHub repo β†’ https://github.com/mantzaris/BenchmarkDataNLP.jl

Key aspect: the complexity of the production are controllable. The vocabulary, grammar production number, punctuation number and more are all linearly proportional to a complexity parameter that can range from 1 to 100. At complexity 100 the vocabulary size is 10K words (complexity 1 has 5 letters and 10 words). Different generators can be used, Context Free Grammars, template strings, RDF, or even a finite state machine production approach. The number of sentences that can be produced are decided upon and testing and training/validation datasets are produced automatically. Aspects of the grammar like polysemy can be controlled or not.

ex:

# generate a dataset using Context Free Grammar, at complexity 20, 1000 sentences (800 lines in training, 100 testing, 100 validation) at the path you choose the files to be generated, eg. "/home/user/Documents"
generate_corpus_CFG(complexity = 20, 
                    num_sentences = 1_000, 
                    enable_polysemy = false, 
                    output_dir = "/home/user/Documents", 
                    base_filename = "MyDataset")

This writes three .jsonl files:
MyDataset_train.jsonl # 800 lines
MyDataset_test.jsonl # 100 lines
MyDataset_valid.jsonl # 100 lines

that inside look like:

{"text":"κ°ƒκ°‡κ°Š κ°† κ°‡ κ°† κ°ƒκ°€κ°‡."}

Each line is UTF-8 textβ€”by default drawn from the Hangul block to avoid accidental real-word bias:

2 Likes