Business
Life
Technology
AboutProblem statement
- We want to build a system that allows computing a large number of expensive functions on rows in a dataset
- We want to do this in a distributed fashion
- The size of the dataset is larger than memory
- the functions are expensive to compute and may either be executed through an API call or against a local GPU
- we want to cache the results in a storage system but ideally avoid long-running DBs to be running
- the system may be shared by multiple teams to benefit from economies of scale
Solution approach
- the distribution of compute will be done via ray as it allows us to run large scale of huggingface models
- the caching layer is an open question
- the reading will happen with kedro and polars datasets in lazy mode
- polars will read the data in batches which will be passed to ray for processing
- we believe that we should determine what needs calculating before passing to ray as we expect the % of the total data that needs computing to be very small in most
moments