Was thinking to try this Zamba impl. Read code and a loop looks odd. Is this some odd global sharing scheme with this fractal shared weight part? Do not understand it, like spamming the same weight with same input and drop all except last output? Or typo? If typo need:
out = x
for layer in self.layers:
out = layer(out)
def forward(self, x) -> Tensor:
# Embed tokens
x = self.embed(x)
if self.post_embed_norm is not False:
x = self.norm(x)
for layer in self.layers:
out = layer(x)
# return OutputHead(self.dim, 1, self.vocab_size)(x)
if self.output_head_on is not False:
out = OutputHead(self.dim, 1, self.vocab_size)(x)
else:
return out
Pay now to fund the work behind this issue.
Get updates on progress being made.
Maintainer is rewarded once the issue is completed.
You're funding impactful open source efforts
You want to contribute to this effort
You want to get funding like this too