Skip to content

Commit

Permalink
2024-11-01-distilbart_xsum_12_6_en (#14447)
Browse files Browse the repository at this point in the history
* Add model 2024-11-01-distilbart_xsum_12_6_en

* Add model 2024-11-03-gpt2_en

* Add model 2024-11-08-hubert_ukrainian_uk

* Add model 2024-11-08-hubert_ukrainian_pipeline_uk

* Add model 2024-11-08-unitku_hubert_japanese_asr_ja

* Add model 2024-11-08-unitku_hubert_japanese_asr_pipeline_ja

* Add model 2024-11-08-hubert_large_japanese_asr_ja

* Add model 2024-11-08-hubert_large_japanese_asr_pipeline_ja

---------

Co-authored-by: ahmedlone127 <ahmedlone127@gmail.com>
  • Loading branch information
jsl-models and ahmedlone127 authored Nov 9, 2024
1 parent d8d4736 commit 5a556ba
Show file tree
Hide file tree
Showing 8 changed files with 626 additions and 0 deletions.
74 changes: 74 additions & 0 deletions docs/_posts/ahmedlone127/2024-11-01-distilbart_xsum_12_6_en.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,74 @@
---
layout: model
title: Abstractive Summarization by BART - DistilBART XSUM
author: John Snow Labs
name: distilbart_xsum_12_6
date: 2024-11-01
tags: [en, summarization, text_to_text, distil, open_source, openvino]
task: Summarization
language: en
edition: Spark NLP 5.5.0
spark_version: 3.0
supported: true
engine: openvino
annotator: BartTransformer
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---

## Description

“BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension Transformer” The Facebook BART (Bidirectional and Auto-Regressive Transformer) model is a state-of-the-art language generation model that was introduced by Facebook AI in 2019. It is based on the transformer architecture and is designed to handle a wide range of natural language processing tasks such as text generation, summarization, and machine translation.

This pre-trained model is DistilBART fine-tuned on the Extreme Summarization (XSum) Dataset.

## Predicted Entities



{:.btn-box}
<button class="button button-orange" disabled>Live Demo</button>
<button class="button button-orange" disabled>Open in Colab</button>
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/distilbart_xsum_12_6_en_5.5.0_3.0_1730492024334.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/distilbart_xsum_12_6_en_5.5.0_3.0_1730492024334.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}

## How to use



<div class="tabs-box" markdown="1">
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
bart = BartTransformer.pretrained("distilbart_xsum_12_6") \
.setTask("summarize:") \
.setMaxOutputLength(200) \
.setInputCols(["documents"]) \
.setOutputCol("summaries")
```
```scala
val bart = BartTransformer.pretrained("distilbart_xsum_12_6")
.setTask("summarize:")
.setMaxOutputLength(200)
.setInputCols("documents")
.setOutputCol("summaries")
```
</div>

{:.model-param}
## Model Information

{:.table-model}
|---|---|
|Model Name:|distilbart_xsum_12_6|
|Compatibility:|Spark NLP 5.5.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[documents]|
|Output Labels:|[generation]|
|Language:|en|
|Size:|853.7 MB|

## References

https://huggingface.co/sshleifer/distilbart-xsum-12-6
93 changes: 93 additions & 0 deletions docs/_posts/ahmedlone127/2024-11-03-gpt2_en.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,93 @@
---
layout: model
title: GPT2 text-to-text model (Base)
author: John Snow Labs
name: gpt2
date: 2024-11-03
tags: [gpt2, en, open_source, onnx, openvino]
task: Text Generation
language: en
edition: Spark NLP 5.5.0
spark_version: 3.0
supported: true
engine: openvino
annotator: GPT2Transformer
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---

## Description

“GPT-2 displays a broad set of capabilities, including the ability to generate conditional synthetic text samples of unprecedented quality, where the model is primed with an input and it generates a lengthy continuation.

## Predicted Entities



{:.btn-box}
<button class="button button-orange" disabled>Live Demo</button>
<button class="button button-orange" disabled>Open in Colab</button>
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/gpt2_en_5.5.0_3.0_1730653115205.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/gpt2_en_5.5.0_3.0_1730653115205.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}

## How to use



<div class="tabs-box" markdown="1">
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python
documentAssembler = DocumentAssembler() \
.setInputCol("text") \
.setOutputCol("documents")

gpt2 = GPT2Transformer.pretrained("gpt2") \
.setInputCols(["documents"]) \
.setMaxOutputLength(50) \
.setOutputCol("generation")

pipeline = Pipeline().setStages([documentAssembler, gpt2])
data = spark.createDataFrame([["My name is Leonardo."]]).toDF("text")
result = pipeline.fit(data).transform(data)
result.select("summaries.generation").show(truncate=False)
```
```scala
val documentAssembler = new DocumentAssembler()
.setInputCol("text")
.setOutputCol("documents")

val gpt2 = GPT2Transformer.pretrained("gpt2")
.setInputCols(Array("documents"))
.setMinOutputLength(10)
.setMaxOutputLength(50)
.setDoSample(false)
.setTopK(50)
.setNoRepeatNgramSize(3)
.setOutputCol("generation")

val pipeline = new Pipeline().setStages(Array(documentAssembler, gpt2))

val data = Seq("My name is Leonardo.").toDF("text")
val result = pipeline.fit(data).transform(data)
results.select("generation.result").show(truncate = false)
```
</div>

{:.model-param}
## Model Information

{:.table-model}
|---|---|
|Model Name:|gpt2|
|Compatibility:|Spark NLP 5.5.0+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[documents]|
|Output Labels:|[generation]|
|Language:|en|
|Size:|467.4 MB|

## References

https://huggingface.co/openai-community/gpt2
Original file line number Diff line number Diff line change
@@ -0,0 +1,84 @@
---
layout: model
title: Japanese hubert_large_japanese_asr HubertForCTC from TKU410410103
author: John Snow Labs
name: hubert_large_japanese_asr
date: 2024-11-08
tags: [ja, open_source, onnx, asr, hubert]
task: Automatic Speech Recognition
language: ja
edition: Spark NLP 5.5.1
spark_version: 3.0
supported: true
engine: onnx
annotator: HubertForCTC
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---

## Description

Pretrained HubertForCTC model, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`hubert_large_japanese_asr` is a Japanese model originally trained by TKU410410103.

{:.btn-box}
<button class="button button-orange" disabled>Live Demo</button>
<button class="button button-orange" disabled>Open in Colab</button>
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/hubert_large_japanese_asr_ja_5.5.1_3.0_1731106819898.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/hubert_large_japanese_asr_ja_5.5.1_3.0_1731106819898.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}

## How to use



<div class="tabs-box" markdown="1">
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python

audioAssembler = AudioAssembler() \
.setInputCol("audio_content") \
.setOutputCol("audio_assembler")

speechToText = HubertForCTC.pretrained("hubert_large_japanese_asr","ja") \
.setInputCols(["audio_assembler"]) \
.setOutputCol("text")

pipeline = Pipeline().setStages([audioAssembler, speechToText])
pipelineModel = pipeline.fit(data)
pipelineDF = pipelineModel.transform(data)

```
```scala

val audioAssembler = new DocumentAssembler()
.setInputCols("audio_content")
.setOutputCols("audio_assembler")

val speechToText = HubertForCTC.pretrained("hubert_large_japanese_asr", "ja")
.setInputCols(Array("audio_assembler"))
.setOutputCol("text")

val pipeline = new Pipeline().setStages(Array(documentAssembler, speechToText))
val pipelineModel = pipeline.fit(data)
val pipelineDF = pipelineModel.transform(data)

```
</div>

{:.model-param}
## Model Information

{:.table-model}
|---|---|
|Model Name:|hubert_large_japanese_asr|
|Compatibility:|Spark NLP 5.5.1+|
|License:|Open Source|
|Edition:|Official|
|Input Labels:|[audio_assembler]|
|Output Labels:|[text]|
|Language:|ja|
|Size:|2.4 GB|

## References

https://huggingface.co/TKU410410103/hubert-large-japanese-asr
Original file line number Diff line number Diff line change
@@ -0,0 +1,69 @@
---
layout: model
title: Japanese hubert_large_japanese_asr_pipeline pipeline HubertForCTC from TKU410410103
author: John Snow Labs
name: hubert_large_japanese_asr_pipeline
date: 2024-11-08
tags: [ja, open_source, pipeline, onnx]
task: Automatic Speech Recognition
language: ja
edition: Spark NLP 5.5.1
spark_version: 3.0
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---

## Description

Pretrained HubertForCTC, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`hubert_large_japanese_asr_pipeline` is a Japanese model originally trained by TKU410410103.

{:.btn-box}
<button class="button button-orange" disabled>Live Demo</button>
<button class="button button-orange" disabled>Open in Colab</button>
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/hubert_large_japanese_asr_pipeline_ja_5.5.1_3.0_1731106937966.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/hubert_large_japanese_asr_pipeline_ja_5.5.1_3.0_1731106937966.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}

## How to use



<div class="tabs-box" markdown="1">
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python

pipeline = PretrainedPipeline("hubert_large_japanese_asr_pipeline", lang = "ja")
annotations = pipeline.transform(df)

```
```scala

val pipeline = new PretrainedPipeline("hubert_large_japanese_asr_pipeline", lang = "ja")
val annotations = pipeline.transform(df)

```
</div>

{:.model-param}
## Model Information

{:.table-model}
|---|---|
|Model Name:|hubert_large_japanese_asr_pipeline|
|Type:|pipeline|
|Compatibility:|Spark NLP 5.5.1+|
|License:|Open Source|
|Edition:|Official|
|Language:|ja|
|Size:|2.4 GB|

## References

https://huggingface.co/TKU410410103/hubert-large-japanese-asr

## Included Models

- AudioAssembler
- HubertForCTC
Original file line number Diff line number Diff line change
@@ -0,0 +1,69 @@
---
layout: model
title: Ukrainian hubert_ukrainian_pipeline pipeline HubertForCTC from Yehor
author: John Snow Labs
name: hubert_ukrainian_pipeline
date: 2024-11-08
tags: [uk, open_source, pipeline, onnx]
task: Automatic Speech Recognition
language: uk
edition: Spark NLP 5.5.1
spark_version: 3.0
supported: true
annotator: PipelineModel
article_header:
type: cover
use_language_switcher: "Python-Scala-Java"
---

## Description

Pretrained HubertForCTC, adapted from Hugging Face and curated to provide scalability and production-readiness using Spark NLP.`hubert_ukrainian_pipeline` is a Ukrainian model originally trained by Yehor.

{:.btn-box}
<button class="button button-orange" disabled>Live Demo</button>
<button class="button button-orange" disabled>Open in Colab</button>
[Download](https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/hubert_ukrainian_pipeline_uk_5.5.1_3.0_1731106461400.zip){:.button.button-orange.button-orange-trans.arr.button-icon}
[Copy S3 URI](s3://auxdata.johnsnowlabs.com/public/models/hubert_ukrainian_pipeline_uk_5.5.1_3.0_1731106461400.zip){:.button.button-orange.button-orange-trans.button-icon.button-copy-s3}

## How to use



<div class="tabs-box" markdown="1">
{% include programmingLanguageSelectScalaPythonNLU.html %}
```python

pipeline = PretrainedPipeline("hubert_ukrainian_pipeline", lang = "uk")
annotations = pipeline.transform(df)

```
```scala

val pipeline = new PretrainedPipeline("hubert_ukrainian_pipeline", lang = "uk")
val annotations = pipeline.transform(df)

```
</div>

{:.model-param}
## Model Information

{:.table-model}
|---|---|
|Model Name:|hubert_ukrainian_pipeline|
|Type:|pipeline|
|Compatibility:|Spark NLP 5.5.1+|
|License:|Open Source|
|Edition:|Official|
|Language:|uk|
|Size:|708.6 MB|

## References

https://huggingface.co/Yehor/hubert-uk

## Included Models

- AudioAssembler
- HubertForCTC
Loading

0 comments on commit 5a556ba

Please sign in to comment.