[ANN] Impostor.jl - A highly versatile synthetic data generator

Thanks for the question, @juliohm!

Impostor.jl differs from Faker.jl in various aspects, I’ll break down these differences into client-side and backend so we can better address them.

  • Client-Side:
    • Generator functions exported by Impostor.jl adhere to the Julia’s multiple dispatch paradigm in order to expand their functionality by adding more options to control how data series are generated. This is a convention established across the API to which virtually all generator functions exported by Impostor adhere; such level of control is not available in Faker.jl. This is more thoroughly explained in the Concepts section in the docs.
    • The templating functionality provided by Faker.jl is improved in Impostor.jl by the addition of relations between the columns in the notion of hierarchical data. That is, data generated by the ImpostorTemplate is expected to respect certain properties found in real data; an example of that is the set of relations between countries, states and cities in which some cities can only be associated to certain countries and so on. The provided my_custom_template in the first post is another example of such situation (notice the interactions between the country_code, state and city columns). Impostor. was designed in such way that these hierarchical restrictions can be incorporated in other providers in the future using the current backend organization.
    • The utility functionalities in Impostor.jl are much more condensed than in Faker.jl in terms of methods to interact with. For example, in Faker functions like lexify, numerify, bothify, random_lowercase_letter, random_uppercase_letter, random_digit and random_number are all centered around the render_alphanumeric function to render string templates.
    • Although both Impostor.jl and Faker.jl support the assignment of locales to the current session, locales in Impostor.jl can be specified on a per-method call basis and both the session and methods have multi-locale support.
  • Backend:
    • Data in Impostor.jl is stored in partitioned .csv files and on-demand loaded according to the needs in the current session by each one of the generator functions. This decision comes from the fact that .csv files are faster parse than the .yamls in Faker.jl and occupy potentially less space in disk. In Impostorl.jl, only the necessary data is loaded when a generator function is called, the sections Design and Structure and Archive Organization present further details on these topics.
    • Virtually all data manipulation takes place as DataFrame operations, this eases implementation of other generator functions by other contributors as it removes the boilerplate when relating different datasets present in Impostor’s Data Archive.

In summary, Impostor.jl is an improved version of Faker.jl, designed from the ground up to be maintainable, versatile and more efficient when it comes to generating tabular data. At the moment, the current selection of providers is not as complete as Faker’s, but in the future and with community input, the plans are to vastly expand this collection.

8 Likes