Skip to content

Commit

Permalink
Add Moonshine (#34784)
Browse files Browse the repository at this point in the history
* config draft

* full encoder forward

* full decoder forward

* fix sdpa and FA2

* fix sdpa and FA2

* moonshine model

* moonshine model forward

* fix attention with past_key_values

* add MoonshineForConditionalGeneration

* fix cache handling and causality for cross attention

* no causal attention mask for the encoder

* model addition (imports etc)

* small nit

* nits

* Update src/transformers/models/moonshine/convert_usefulsensors_to_hf.py

Co-authored-by: Joshua Lochner <admin@xenova.com>

* add rope_theta

* nits

* model doc

* Update src/transformers/models/auto/configuration_auto.py

Co-authored-by: Joshua Lochner <admin@xenova.com>

* imports

* add MODEL_FOR_SPEECH_SEQ_2_SEQ_MAPPING_NAMES

* updates modular

* make

* make fix-copies

* ruff check examples fix

* fix check_modular_conversion

* nit

* nits

* nits

* copied from -> imports

* imports fix

* integrate attention refacto

* modular edge case

* remove encoder

* convolutions params in config

* run modular_model_converter

* make

* Update docs/source/en/model_doc/moonshine.md

Co-authored-by: Joshua Lochner <admin@xenova.com>

* MoonshineModelTest

* correct typo

* make style

* integration tests

* make

* modular convert

* name conversion update (up_proj -> fc1 etc)

* update config

* update MLP

* update attention

* update encoder layer

* update decoder layer

* update convolutions parameters

* update encoder

* remove INPUTS_DOCSTRING

* update decoder

* update conditional generation

* update pretrained model

* imports

* modular converted

* update doc

* fix

* typo

* update doc

* update license

* update init

* split config in file

* two classes for MLP

* attention from GLM

* from GlmRotaryEmbedding

* split MLP

* apply arthur's review suggestions

* apply arthur's review suggestions

* apply arthur's review suggestions

* auto feature extractor

* convert modular

* fix + make

* convert modular

* make

* unsplit config

* use correct checkpoint

* wrap generate

* update tests

* typos

* make

* typo

* update doc

---------

Co-authored-by: Joshua Lochner <admin@xenova.com>
  • Loading branch information
eustlb and xenova authored Jan 10, 2025
1 parent 6f127d3 commit 5f087d1
Show file tree
Hide file tree
Showing 19 changed files with 3,852 additions and 2 deletions.
4 changes: 3 additions & 1 deletion docs/source/en/_toctree.yml
Original file line number Diff line number Diff line change
Expand Up @@ -505,7 +505,9 @@
- local: model_doc/mobilebert
title: MobileBERT
- local: model_doc/modernbert
title: ModernBERT
title: ModernBert
- local: model_doc/moonshine
title: moonshine
- local: model_doc/mpnet
title: MPNet
- local: model_doc/mpt
Expand Down
1 change: 1 addition & 0 deletions docs/source/en/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -235,6 +235,7 @@ Flax), PyTorch, and/or TensorFlow.
| [MobileViT](model_doc/mobilevit) ||||
| [MobileViTV2](model_doc/mobilevitv2) ||||
| [ModernBERT](model_doc/modernbert) ||||
| [Moonshine](model_doc/moonshine) ||||
| [Moshi](model_doc/moshi) ||||
| [MPNet](model_doc/mpnet) ||||
| [MPT](model_doc/mpt) ||||
Expand Down
56 changes: 56 additions & 0 deletions docs/source/en/model_doc/moonshine.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,56 @@
<!--Copyright 2025 The HuggingFace Team. All rights reserved.
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.
⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
rendered properly in your Markdown viewer.
-->

# Moonshine

## Overview

The Moonshine model was proposed in [Moonshine: Speech Recognition for Live Transcription and Voice Commands
](https://arxiv.org/abs/2410.15608) by Nat Jeffries, Evan King, Manjunath Kudlur, Guy Nicholson, James Wang, Pete Warden.

The abstract from the paper is the following:

*This paper introduces Moonshine, a family of speech recognition models optimized for live transcription and voice command processing. Moonshine is based on an encoder-decoder transformer architecture and employs Rotary Position Embedding (RoPE) instead of traditional absolute position embeddings. The model is trained on speech segments of various lengths, but without using zero-padding, leading to greater efficiency for the encoder during inference time. When benchmarked against OpenAI's Whisper tiny-en, Moonshine Tiny demonstrates a 5x reduction in compute requirements for transcribing a 10-second speech segment while incurring no increase in word error rates across standard evaluation datasets. These results highlight Moonshine's potential for real-time and resource-constrained applications.*

Tips:

- Moonshine improves upon Whisper's architecture:
1. It uses SwiGLU activation instead of GELU in the decoder layers
2. Most importantly, it replaces absolute position embeddings with Rotary Position Embeddings (RoPE). This allows Moonshine to handle audio inputs of any length, unlike Whisper which is restricted to fixed 30-second windows.

This model was contributed by [Eustache Le Bihan (eustlb)](https://huggingface.co/eustlb).
The original code can be found [here](https://github.com/usefulsensors/moonshine).

## Resources

- [Automatic speech recognition task guide](../tasks/asr)

## MoonshineConfig

[[autodoc]] MoonshineConfig

## MoonshineModel

[[autodoc]] MoonshineModel
- forward
- _mask_input_features

## MoonshineForConditionalGeneration

[[autodoc]] MoonshineForConditionalGeneration
- forward
- generate

4 changes: 3 additions & 1 deletion docs/source/en/perf_infer_gpu_one.md
Original file line number Diff line number Diff line change
Expand Up @@ -68,6 +68,7 @@ FlashAttention-2 is currently supported for the following architectures:
* [Llava-NeXT](https://huggingface.co/docs/transformers/model_doc/llava_next)
* [Llava-NeXT-Video](https://huggingface.co/docs/transformers/model_doc/llava_next_video)
* [LLaVA-Onevision](https://huggingface.co/docs/transformers/model_doc/llava_onevision)
* [Moonshine](https://huggingface.co/docs/transformers/model_doc/moonshine#transformers.MoonshineModel)
* [Mimi](https://huggingface.co/docs/transformers/model_doc/mimi)
* [VipLlava](https://huggingface.co/docs/transformers/model_doc/vipllava)
* [VideoLlava](https://huggingface.co/docs/transformers/model_doc/video_llava)
Expand Down Expand Up @@ -265,6 +266,7 @@ For now, Transformers supports SDPA inference and training for the following arc
* [Llava-NeXT-Video](https://huggingface.co/docs/transformers/model_doc/llava_next_video)
* [LLaVA-Onevision](https://huggingface.co/docs/transformers/model_doc/llava_onevision)
* [M2M100](https://huggingface.co/docs/transformers/model_doc/m2m_100#transformers.M2M100Model)
* [Moonshine](https://huggingface.co/docs/transformers/model_doc/moonshine#transformers.MoonshineModel)
* [Mimi](https://huggingface.co/docs/transformers/model_doc/mimi)
* [Mistral](https://huggingface.co/docs/transformers/model_doc/mistral#transformers.MistralModel)
* [Mllama](https://huggingface.co/docs/transformers/model_doc/mllama#transformers.MllamaForConditionalGeneration)
Expand All @@ -283,8 +285,8 @@ For now, Transformers supports SDPA inference and training for the following arc
* [Phi3](https://huggingface.co/docs/transformers/model_doc/phi3#transformers.Phi3Model)
* [PhiMoE](https://huggingface.co/docs/transformers/model_doc/phimoe#transformers.PhimoeModel)
* [Idefics](https://huggingface.co/docs/transformers/model_doc/idefics#transformers.IdeficsModel)
* [Whisper](https://huggingface.co/docs/transformers/model_doc/whisper#transformers.WhisperModel)
* [mBart](https://huggingface.co/docs/transformers/model_doc/mbart#transformers.MBartModel)
* [Moonshine](https://huggingface.co/docs/transformers/model_doc/moonshine#transformers.MoonshineModel)
* [Mistral](https://huggingface.co/docs/transformers/model_doc/mistral#transformers.MistralModel)
* [Mixtral](https://huggingface.co/docs/transformers/model_doc/mixtral#transformers.MixtralModel)
* [StableLm](https://huggingface.co/docs/transformers/model_doc/stablelm#transformers.StableLmModel)
Expand Down
14 changes: 14 additions & 0 deletions src/transformers/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -610,6 +610,7 @@
"models.mobilevit": ["MobileViTConfig"],
"models.mobilevitv2": ["MobileViTV2Config"],
"models.modernbert": ["ModernBertConfig"],
"models.moonshine": ["MoonshineConfig"],
"models.moshi": [
"MoshiConfig",
"MoshiDepthConfig",
Expand Down Expand Up @@ -2907,6 +2908,13 @@
"ModernBertPreTrainedModel",
]
)
_import_structure["models.moonshine"].extend(
[
"MoonshineForConditionalGeneration",
"MoonshineModel",
"MoonshinePreTrainedModel",
]
)
_import_structure["models.moshi"].extend(
[
"MoshiForCausalLM",
Expand Down Expand Up @@ -5633,6 +5641,7 @@
MobileViTV2Config,
)
from .models.modernbert import ModernBertConfig
from .models.moonshine import MoonshineConfig
from .models.moshi import (
MoshiConfig,
MoshiDepthConfig,
Expand Down Expand Up @@ -7652,6 +7661,11 @@
ModernBertModel,
ModernBertPreTrainedModel,
)
from .models.moonshine import (
MoonshineForConditionalGeneration,
MoonshineModel,
MoonshinePreTrainedModel,
)
from .models.moshi import (
MoshiForCausalLM,
MoshiForConditionalGeneration,
Expand Down
1 change: 1 addition & 0 deletions src/transformers/models/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -170,6 +170,7 @@
mobilevit,
mobilevitv2,
modernbert,
moonshine,
moshi,
mpnet,
mpt,
Expand Down
2 changes: 2 additions & 0 deletions src/transformers/models/auto/configuration_auto.py
Original file line number Diff line number Diff line change
Expand Up @@ -190,6 +190,7 @@
("mobilevit", "MobileViTConfig"),
("mobilevitv2", "MobileViTV2Config"),
("modernbert", "ModernBertConfig"),
("moonshine", "MoonshineConfig"),
("moshi", "MoshiConfig"),
("mpnet", "MPNetConfig"),
("mpt", "MptConfig"),
Expand Down Expand Up @@ -519,6 +520,7 @@
("mobilevit", "MobileViT"),
("mobilevitv2", "MobileViTV2"),
("modernbert", "ModernBERT"),
("moonshine", "Moonshine"),
("moshi", "Moshi"),
("mpnet", "MPNet"),
("mpt", "MPT"),
Expand Down
1 change: 1 addition & 0 deletions src/transformers/models/auto/feature_extraction_auto.py
Original file line number Diff line number Diff line change
Expand Up @@ -73,6 +73,7 @@
("mobilenet_v1", "MobileNetV1FeatureExtractor"),
("mobilenet_v2", "MobileNetV2FeatureExtractor"),
("mobilevit", "MobileViTFeatureExtractor"),
("moonshine", "Wav2Vec2FeatureExtractor"),
("moshi", "EncodecFeatureExtractor"),
("nat", "ViTFeatureExtractor"),
("owlvit", "OwlViTFeatureExtractor"),
Expand Down
3 changes: 3 additions & 0 deletions src/transformers/models/auto/modeling_auto.py
Original file line number Diff line number Diff line change
Expand Up @@ -179,6 +179,7 @@
("mobilevit", "MobileViTModel"),
("mobilevitv2", "MobileViTV2Model"),
("modernbert", "ModernBertModel"),
("moonshine", "MoonshineModel"),
("moshi", "MoshiModel"),
("mpnet", "MPNetModel"),
("mpt", "MptModel"),
Expand Down Expand Up @@ -436,6 +437,7 @@
("mega", "MegaForMaskedLM"),
("megatron-bert", "MegatronBertForCausalLM"),
("mobilebert", "MobileBertForMaskedLM"),
("moonshine", "MoonshineForConditionalGeneration"),
("mpnet", "MPNetForMaskedLM"),
("mpt", "MptForCausalLM"),
("mra", "MraForMaskedLM"),
Expand Down Expand Up @@ -937,6 +939,7 @@

MODEL_FOR_SPEECH_SEQ_2_SEQ_MAPPING_NAMES = OrderedDict(
[
("moonshine", "MoonshineForConditionalGeneration"),
("pop2piano", "Pop2PianoForConditionalGeneration"),
("seamless_m4t", "SeamlessM4TForSpeechToText"),
("seamless_m4t_v2", "SeamlessM4Tv2ForSpeechToText"),
Expand Down
1 change: 1 addition & 0 deletions src/transformers/models/auto/processing_auto.py
Original file line number Diff line number Diff line change
Expand Up @@ -81,6 +81,7 @@
("mctct", "MCTCTProcessor"),
("mgp-str", "MgpstrProcessor"),
("mllama", "MllamaProcessor"),
("moonshine", "Wav2Vec2Processor"),
("oneformer", "OneFormerProcessor"),
("owlv2", "Owlv2Processor"),
("owlvit", "OwlViTProcessor"),
Expand Down
1 change: 1 addition & 0 deletions src/transformers/models/auto/tokenization_auto.py
Original file line number Diff line number Diff line change
Expand Up @@ -321,6 +321,7 @@
("mluke", ("MLukeTokenizer" if is_sentencepiece_available() else None, None)),
("mobilebert", ("MobileBertTokenizer", "MobileBertTokenizerFast" if is_tokenizers_available() else None)),
("modernbert", (None, "PreTrainedTokenizerFast" if is_tokenizers_available() else None)),
("moonshine", (None, "PreTrainedTokenizerFast" if is_tokenizers_available() else None)),
("moshi", (None, "PreTrainedTokenizerFast" if is_tokenizers_available() else None)),
("mpnet", ("MPNetTokenizer", "MPNetTokenizerFast" if is_tokenizers_available() else None)),
("mpt", (None, "GPTNeoXTokenizerFast" if is_tokenizers_available() else None)),
Expand Down
27 changes: 27 additions & 0 deletions src/transformers/models/moonshine/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
# Copyright 2025 The HuggingFace Team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from typing import TYPE_CHECKING

from ...utils import _LazyModule
from ...utils.import_utils import define_import_structure


if TYPE_CHECKING:
from .configuration_moonshine import *
from .modeling_moonshine import *
else:
import sys

_file = globals()["__file__"]
sys.modules[__name__] = _LazyModule(__name__, _file, define_import_structure(_file), module_spec=__spec__)
Loading

0 comments on commit 5f087d1

Please sign in to comment.