Multiheaded sparse gating network

Large networks leverage sparsity to control compute costs.
By sparsifying using static or dynamic approaches, it can be used an amount of mixture-of-expert to activate a peculiar expert. By using DeepMoe instead on deep routing, increases the representational power of convolutional networks by adaptively sparsifying and recalibrating. All the info in link below.

Links

https://arxiv.org/pdf/1806.01531.pdf