I was reading another issues and walking through the code I realized that the example.py doesn't load any checkpoint, so, as far as I understood, the example.py and what you got so far from the code is just the pre trained vit, transformers, and so on, and those items were not retrained in the MMM setting proposed by the paper, right?
It's just because I have a few days to do some testing, and as far as I realized, even if I implement the img and text bi-encoder I would get the same performance as the model has, right?
Pay now to fund the work behind this issue.
Get updates on progress being made.
Maintainer is rewarded once the issue is completed.
You're funding impactful open source efforts
You want to contribute to this effort
You want to get funding like this too