Thanks for the nice work.
According to this code, it seems that image token is also given for the image_pos_token which is different with the original paper.
From the paper, the camera extrinsic parameters and intrinsic parameters are transformed to image_pos_token by forwarding through MLP layer if my understanding is right. It will be grateful if you let me know where I can find the code for computing image_pos_token.
Pay now to fund the work behind this issue.
Get updates on progress being made.
Maintainer is rewarded once the issue is completed.
You're funding impactful open source efforts
You want to contribute to this effort
You want to get funding like this too