Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can "item" of ArrayType be renamed via an option when writing an XML file? #602

Closed
giuseppeceravolo opened this issue Aug 31, 2022 · 12 comments

Comments

@giuseppeceravolo
Copy link

giuseppeceravolo commented Aug 31, 2022

I am writing the XML file below and would like to know how I can rename "item" with "record" (since those items are within the "records" tag). Perhaps there is a way to change the value "item" in here.

<?xml version="1.0" encoding="UTF-8"?>
<inventory xmlns="http://www.domain.com/xml/">
    <inventory-list>
        <header list-id="myShop">
            <default>false</default>
        </header>
        <records>
            <item product-id="xxxxxx-yyy1">
                <qty>0</qty>
            </item>
            <item product-id="xxxxxx-yyy2">
                <qty>0</qty>
            </item>
            <item product-id="xxxxxx-yyy3">
                <qty>0</qty>
            </item>
            <item product-id="xxxxxx-yyy4">
                <qty>0</qty>
            </item>
            <item product-id="xxxxxx-yyy5">
                <qty>0</qty>
            </item>
        </records>
    </inventory-list>
</inventory>

Here is the schema of my dataframe:

root
 |-- header: struct (nullable = true)
 |    |-- _list-id: string (nullable = true)
 |    |-- default: boolean (nullable = true)
 |-- records: array (nullable = false)
 |    |-- element: array (containsNull = false)
 |    |    |-- element: struct (containsNull = false)
 |    |    |    |-- _product-id: string (nullable = true)
 |    |    |    |-- qty: integer (nullable = true)
@srowen
Copy link
Collaborator

srowen commented Aug 31, 2022

There's not a way to do it right now, but yeah I think that's a relatively simple feature request -- if the idea is to have one new name for all array items, not per type or something. I could probably add that now

@giuseppeceravolo
Copy link
Author

As of now I do not have any other array column in the output XML file so just one name would be enough, thank you!
I am working on Databricks where my cluster has version 10.4 LTS (includes Apache Spark 3.2.1, Scala 2.12) and "com.databricks:spark-xml_2.12:0.14.0" installed. Please let me know how I can get your changes from there.

By the way, do you believe it has something to do with the fact that "records" has 2 nested "element" fields? Maybe if there is a way to rename "element" into "record", do you believe I could fix it by myself?

@giuseppeceravolo
Copy link
Author

For the sake of completeness, I am adding the code I use to write the XML file.

df \
  .coalesce(1) \
  .write \
  .format('com.databricks.spark.xml') \
  .option('declaration', 'version="1.0" encoding="UTF-8"') \
  .option('rootTag', 'inventory xmlns="http://www.demandware.com/xml/impex/inventory/2007-05-31"') \
  .option('rowTag', 'inventory-list') \
  .mode('overwrite') \
  .save('/mnt/container/folder/file.xml')

@srowen
Copy link
Collaborator

srowen commented Aug 31, 2022

To get the changes, I'd have to make the changes and release a new version. That might take some time. But then you just install a new version as usual.

"element" isn't really part of the schema w.r.t. how you access it in Spark, that's not related, no. You can't rename it, it doesn't matter.

@srowen
Copy link
Collaborator

srowen commented Aug 31, 2022

#603

@srowen srowen closed this as completed Aug 31, 2022
@giuseppeceravolo
Copy link
Author

Sorry to bother you, just so I know, when is version 0.16 going to be released? Because this feature is requested for one of my current projects. Thank you so much! 😁

@srowen
Copy link
Collaborator

srowen commented Sep 5, 2022

I hadn't planned to make a new release for a while. Can you just build the library from source and use it right now?

@giuseppeceravolo
Copy link
Author

I see. Is it possible to do so on Databricks? If so, could you please be so kind to point out the best way to do it? Thank you for your support.

@srowen
Copy link
Collaborator

srowen commented Sep 5, 2022

Sure, in Databricks you can just attach a JAR file to a cluster. You just need to build a JAR file -- one including all dependencies -- from the project. Check out the code and run sbt assembly and you should find the JAR in target/scala-2.12/spark-xml-assembly-0.16.0.jar. When it's released you'd also be able to just add it by Maven coordinates rather than build it

@giuseppeceravolo
Copy link
Author

giuseppeceravolo commented Jun 2, 2023

Hi 😃 it's me again! Instead of having one name for all array items, now I need a way to specify the name of the array for each element... Do you believe it could be possible to have such enhancement? Thank you in advance!

Something like the following:

<?xml version="1.0" encoding="UTF-8"?>
<Inventory>
    <Item>
        ...
        <ItemAttribute Name="ATTRIBUTE1">
            <AttributeCodeValue>
                <AttributeCode>1</AttributeCode>
                <AttributeValue>Value1</AttributeValue>
            </AttributeCodeValue>
        </ItemAttribute>
        <ItemAttribute Name="ATTRIBUTE2">
            <AttributeCodeValue>
                <AttributeCode>2</AttributeCode>
                <AttributeValue>Value2</AttributeValue>
            </AttributeCodeValue>
        </ItemAttribute>
        <ItemAttribute Name="ATTRIBUTE3">
            <AttributeCodeValue>
                <AttributeCode>3</AttributeCode>
                <AttributeValue>Value3</AttributeValue>
            </AttributeCodeValue>
        </ItemAttribute>
        ....
        <ItemFranchisees>
            <ItemFranchisee action="ADD" franchiseeId="F1" franchiseeName="F1"/>
            <ItemFranchisee action="ADD" franchiseeId="F2" franchiseeName="F2"/>
            <ItemFranchisee action="ADD" franchiseeId="F3" franchiseeName="F3"/>
        </ItemFranchisees>
    </Item>
</Inventory>

With the code below, as of 0.16.0, I get the elements inside "ItemFranchisees" named as "AttributeCodeValue", but I would like them to be named as "ItemFranchisee" (see example above).

df \
  .coalesce(1) \
  .write \
  .format('com.databricks.spark.xml') \
  .option('declaration', 'version="1.0" encoding="UTF-8"') \
  .option('rootTag', 'Inventory') \
  .option('rowTag', 'Item') \
  .option('arrayElementName', 'AttributeCodeValue') \
  .mode('overwrite') \
  .save('/mnt/container/folder/file.xml')

@srowen
Copy link
Collaborator

srowen commented Jun 2, 2023

I don't think that's possible to support here easily. You can further transform the XML file with a library

@giuseppeceravolo
Copy link
Author

I see. Thank you anyway for your prompt reply

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants