[ANN] Impostor.jl - A highly versatile synthetic data generator

I’m glad to publish the announcement of Impostor.jl, a highly versatile synthetic tabular data generator package written in Julia.

Designed and built upon the Julia’s multiple-dispatch paradigm with simplicity in mind, Impostor exports several generator and utility functions through its API providing the user with a wide variety of options to choose from when generating synthetic data. Another key feature is its ability to make sense of different relations between columns while generating data via templates. Check out the documentation for more detailed information on this topic as well as concepts, conventions and how data generation is handled under the hood.

A couple of usage examples are presented below:

using Impostor
using DataFrames


credit_card_number(; formatted = true)
# "4767-6731-1326-5309"

surname(4; locale = ["pt_BR"])
# 4-element Vector{String}:
#  "Feranndes"
#  "Pereira"
#  "Camargo"
#  "Pereira"

firstname(["M"], 4)
# 4-element Vector{String}:
#  "Charles"
#  "Zacharias"
#  "Paul"
#  "Charles"

city(["BRA", "USA"], 4; level=:country_code)
# 4-element Vector{String}:
#  "Curitiba"
#  "Los Angeles"
#  "São Paulo"
#  "Rio de Janeiro"

address(["BRA", "USA", "BRA", "USA"]; level = :country_code)
# 4-element Vector{String}:
#  "Avenida Paulo Lombardi 1834, Ba" ⋯ 25 bytes ⋯ "84-514, Porto Alegre-RS, Brasil"
#  "Abgail Smith Alley, Los Angeles" ⋯ 42 bytes ⋯ "ornia, United States of America"
#  "Avenida Tomas Lins 4324, (Apto " ⋯ 23 bytes ⋯ "orocaba - 89457-346, SP, Brasil"
#  "South-side Street 1st Floor, Li" ⋯ 52 bytes ⋯ "as-AR, United States of America"


my_custom_template = ImpostorTemplate([:firstname, :surname, :country_code, :state, :city]);

my_custom_template(4, DataFrame; locale = ["pt_BR", "en_US"])
# 4×5 DataFrame
#  Row │ firstname  surname   country_code  state           city
#      │ String     String    String3       String15        String15
# ─────┼───────────────────────────────────────────────────────────────────
#    1 │ Mary       Collins   BRA           Rio de Janeiro  Rio de Janeiro
#    2 │ Kate       Cornell   USA           Illinois        Chicago
#    3 │ Carl       Fraser    BRA           Paraná          Curitiba
#    4 │ Milly      da Silva  USA           California      Los Angeles


template_string = "I know firstname surname, this person is a(n) occupation";

render_template(template_string)
# "I know Charles Jameson, this person is a(n) Mathematician"

println("My new car plate is $(render_alphanumeric("^^^-####"))")
# My new car plate is TXP-9236

Currently Impostor.jl is in a MVP state, so to speak; while many features are planned to be implemented soon, I’d like to get the feedback from the community first in order to understand where efforts should be focused. Should you have any questions, suggestions or any general feedback, feel free to open an issue in the repository or respond to this post with your thoughts.

Thanks!
Enzo

14 Likes

Can you please provide a comparison with Faker.jl?

There are probably other packages with similar features.

1 Like

Thanks for the question, @juliohm!

Impostor.jl differs from Faker.jl in various aspects, I’ll break down these differences into client-side and backend so we can better address them.

  • Client-Side:
    • Generator functions exported by Impostor.jl adhere to the Julia’s multiple dispatch paradigm in order to expand their functionality by adding more options to control how data series are generated. This is a convention established across the API to which virtually all generator functions exported by Impostor adhere; such level of control is not available in Faker.jl. This is more thoroughly explained in the Concepts section in the docs.
    • The templating functionality provided by Faker.jl is improved in Impostor.jl by the addition of relations between the columns in the notion of hierarchical data. That is, data generated by the ImpostorTemplate is expected to respect certain properties found in real data; an example of that is the set of relations between countries, states and cities in which some cities can only be associated to certain countries and so on. The provided my_custom_template in the first post is another example of such situation (notice the interactions between the country_code, state and city columns). Impostor. was designed in such way that these hierarchical restrictions can be incorporated in other providers in the future using the current backend organization.
    • The utility functionalities in Impostor.jl are much more condensed than in Faker.jl in terms of methods to interact with. For example, in Faker functions like lexify, numerify, bothify, random_lowercase_letter, random_uppercase_letter, random_digit and random_number are all centered around the render_alphanumeric function to render string templates.
    • Although both Impostor.jl and Faker.jl support the assignment of locales to the current session, locales in Impostor.jl can be specified on a per-method call basis and both the session and methods have multi-locale support.
  • Backend:
    • Data in Impostor.jl is stored in partitioned .csv files and on-demand loaded according to the needs in the current session by each one of the generator functions. This decision comes from the fact that .csv files are faster parse than the .yamls in Faker.jl and occupy potentially less space in disk. In Impostorl.jl, only the necessary data is loaded when a generator function is called, the sections Design and Structure and Archive Organization present further details on these topics.
    • Virtually all data manipulation takes place as DataFrame operations, this eases implementation of other generator functions by other contributors as it removes the boilerplate when relating different datasets present in Impostor’s Data Archive.

In summary, Impostor.jl is an improved version of Faker.jl, designed from the ground up to be maintainable, versatile and more efficient when it comes to generating tabular data. At the moment, the current selection of providers is not as complete as Faker’s, but in the future and with community input, the plans are to vastly expand this collection.

8 Likes

Could you clarify what you mean by “synthetic data” here? There’s several different uses of the term in ML/statistics. Do you mean:

  1. Example data for testing machine learning algorithms?
  2. Data from a model designed to mimic a real dataset, without having to disclose private information?
  3. Something else?

@ParadaCarleton the term should be clear already. And the examples in the readme show it well. The word synthetic in English just means “non-real”, “fake”, created to “mimic” real data, “fabricated”, any synonym should fit well.

2 Likes