Currently the batch
method fails with a validation error if any of the generated rows fail the schema validators. To allow use of the package in a testing environment, it would be useful to be able to generate a dataframe of any size using a rejection sampler method. This method should store the random seeds of successful builds in order to reproduce the same dataframe each time.
I have created a class that performs these actions included below. Given this is something I have needed for my project, it could be a useful feature for others wanting to use Polyfactory for testing. I built it based off the original pydantic factories package, but I imagine it would be pretty similar for the additional Factory options in Polyfactory.
import time
import json
import pandas as pd
from polyfactory.factories.pydantic_factory import ModelFactory
class RejectionSampler:
"""Function to create a synthetic dataset based off the pydantic schema,
dropping rows that do not meet the validation set up in the schema.
Parameters
----------
factory (ModelFactory): pydantic factories ModelFactory created from pydantic schema
size (int): Length of dataset to create
"""
def __init__(self, factory: ModelFactory, size: int) -> None:
self.factory = factory
self.size = size
self.used_seeds = []
def setup_seeds(self):
start = time.time()
synthetic_data = pd.DataFrame()
# start seed at 1, increase seed by 1 each pass/fail of factory.build() to ensure reproducibility
seed_no = 1
for _ in range(self.size):
result = None
while not result:
try:
self.factory.seed_random(seed_no)
result = self.factory.build()
result_dict = json.loads(result.json())
synthetic_data = synthetic_data.append(
pd.DataFrame(result_dict, index=[0])
)
self.used_seeds += [seed_no]
seed_no += 1
result = True
except ValidationError:
seed_no += 1
end = time.time()
print(f"finished, took {seed_no-1} attempts to generate {self.size} rows")
print(f"took {end-start} seconds to setup seeds")
def generate(self):
start = time.time()
synthetic_data = pd.DataFrame()
for seed in self.used_seeds:
self.factory.seed_random(seed)
result = self.factory.build()
result_dict = json.loads(result.json())
synthetic_data = synthetic_data.append(pd.DataFrame(result_dict, index=[0]))
end = time.time()
print(f"took {end-start} seconds to generate new data")
return synthetic_data
No response
No response
Pay now to fund the work behind this issue.
Get updates on progress being made.
Maintainer is rewarded once the issue is completed.
You're funding impactful open source efforts
You want to contribute to this effort
You want to get funding like this too