Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

.setReadMonthFirst always reads month first #14100

Closed
1 task done
KyriakosAseto opened this issue Dec 18, 2023 · 2 comments · Fixed by #14381
Closed
1 task done

.setReadMonthFirst always reads month first #14100

KyriakosAseto opened this issue Dec 18, 2023 · 2 comments · Fixed by #14381
Assignees

Comments

@KyriakosAseto
Copy link

Is there an existing issue for this?

  • I have searched the existing issues and did not find a match.

Who can help?

No response

What are you working on?

I am using the example provider by spark nlp and customize the methods and I am trying to set to not read the month first

date = DateMatcher() \
    .setInputCols("document") \
    .setOutputCol("date") \
    .setReadMonthFirst(False) \
    .setOutputFormat("dd/MM/yyyy")

multiDate = MultiDateMatcher() \
    .setInputCols("document") \
    .setReadMonthFirst(False) \
    .setOutputCol("multi_date") \
    .setOutputFormat("dd/MM/yyyy") 

Current Behavior

The parameter set to False does not matter as it always returns by first month from the input
image

Please see example "I was born at 01/03/98" which is indented to be 1st of March of 1998.

Expected Behavior

To read my example 01/03/1998 by not the month first

Steps To Reproduce

import sparknlp
from sparknlp.annotator import DocumentAssembler, DateMatcher, MultiDateMatcher
from pyspark.sql.types import StringType
from pyspark.ml import Pipeline

spark = sparknlp.start()
spark


documentAssembler = DocumentAssembler() \
    .setInputCol("text") \
    .setOutputCol("document")

date = DateMatcher() \
    .setInputCols("document") \
    .setOutputCol("date") \
    .setReadMonthFirst(False) \
    .setOutputFormat("dd/MM/yyyy")

multiDate = MultiDateMatcher() \
    .setInputCols("document") \
    .setReadMonthFirst(False) \
    .setOutputCol("multi_date") \
    .setOutputFormat("dd/MM/yyyy") 


pipeline = Pipeline().setStages([
    documentAssembler,
    date,
    multiDate
    ])

text_list = ["See you on next monday.", 
             "I was born at 01/03/98", 
             "She was born on 02/03/1966.", 
             "The project started yesterday and will finish next year.", 
             "She will graduate by July 2023.", 
             "She will visit doctor tomorrow and next month again."]

spark_df = spark.createDataFrame(text_list, StringType()).toDF("text")

result = pipeline.fit(spark_df).transform(spark_df)
result.selectExpr("text","date.result as date", "multi_date.result as multi_date").show(truncate=False)

Spark NLP version and Apache Spark

spark-nlp==5.2.0

Type of Spark Application

Python Application

Java Version

openjdk version "11.0.21" 2023-10-17

Java Home Directory

/usr/lib/jvm/java-11-openjdk-amd64

Setup and installation

pip install numpy py4j pyspark spark-nlp

Operating System and Version

Ubuntu-22.04

Link to your project (if available)

No response

Additional Information

No response

Copy link

This issue is stale because it has been open 180 days with no activity. Remove stale label or comment or this will be closed in 5 days

@github-actions github-actions bot added the Stale label Jun 17, 2024
@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Jul 1, 2024
@Aleksis99
Copy link

This issue is still present.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants