Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] GetJsonObject: Implement JSON generator to print JSON items #1831

Closed
4 tasks done
res-life opened this issue Mar 5, 2024 · 8 comments
Closed
4 tasks done

[FEA] GetJsonObject: Implement JSON generator to print JSON items #1831

res-life opened this issue Mar 5, 2024 · 8 comments
Assignees

Comments

@res-life
Copy link
Collaborator

res-life commented Mar 5, 2024

Is your feature request related to a problem? Please describe.
Implement JSON generator to print JSON items

Additional context
Epic issue: #1823

tasks in NVIDIA/spark-rapids#10218

  • item1
  • item2
  • item3 number normalization
  • item4
@SurajAralihalli
Copy link
Collaborator

SurajAralihalli commented Mar 7, 2024

In the cudf's get_json_object , writing the output to the result column happens in two phases. In the first phase the size of the output for each row (per thread) in computed. if (!out_buf.has_value()) { d_sizes[tid] = output_size; } json_path.cu#L952

Consider a JSON Column with 2 rows: { "ab" : "pqr" }, { "ab" : "lmn" }

get_json_object( col, "$.ab") returns a cudf column which is made from make_strings_column.
json_path.cu#L1046

col.size() -> 2
offsets -> 0,3
chars.release() -> pqrlmn
nullcount -> 0
validity -> 1,1

In the second phase, the offsets are used in each thread to write to the correct location in the device memory json_path.cu#L938C7

  1. Memory before : xxxxxx
    thread 1 writes at 0 -> pqrxxx
    thread 2 writes at 3 -> xxxlmn

  2. Memory after: pqrlmn

However Spark uses temporary memory while writing JSON. jsonExpressions.scala#L278.

// temporarily buffer child matches, the emitted json will need to be
// modified slightly if there is only a single element written
        val buffer = new StringWriter()

        var dirty = 0
        Utils.tryWithResource(jsonFactory.createGenerator(buffer)) { flattenGenerator =>
          flattenGenerator.writeStartArray()

          while (p.nextToken() != END_ARRAY) {
            // track the number of array elements and only emit an outer array if
            // we've written more than one element, this matches Hive's behavior
            dirty += (if (evaluatePath(p, flattenGenerator, nextStyle, xs)) 1 else 0)
          }
          flattenGenerator.writeEndArray()
        }

Potential issues can be:

  1. Final parent memory can be lesser than child memory we might have to take the upper bound while allocating. This can lead to holes in the device memory which needs to be addressed.

  2. As child memory is part of parent memory, there can be memory overlap while copying g.writeRawValue(buf.toString). This can lead to data corruption for nested queries.

@res-life
Copy link
Collaborator Author

res-life commented Mar 7, 2024

This can lead to holes in the device memory which needs to be addressed.

We may put the leading char of child memory into a parameter of evaluatePath

  private def evaluatePath(
      p: JsonParser,
      g: JsonGenerator,
      style: WriteStyle,
      path: List[PathInstruction]): Boolean = { 

==>>

  private def evaluatePath(
      p: JsonParser,
      g_leading_char: char,      // Add this parameter
      g: JsonGenerator,
      style: WriteStyle,
      path: List[PathInstruction]): Boolean = {

@res-life
Copy link
Collaborator Author

res-life commented Mar 12, 2024

The interfaces I think is:
write_start_array();
write_end_array();
write_raw(); // invoke parser.copy_raw_text
write_raw_value(); // notice this, it's different with write_raw, should write comma or colon if needed.

    g.writeStartArray();
    g.writeRawValue("1");
    g.writeRawValue("2")
// will produce: [1,2
// the , char is added.

copy_current_structure(parser); // invoke parser.copy_text, note: I'll add this as soon as possible.
json_generator new_child_generator();
get_output_len();
get_output_start_position();
get_current_output_position();

Normalization:
0.001e-3 => 1.0E-6
......

@res-life
Copy link
Collaborator Author

Utility: #1863

@SurajAralihalli
Copy link
Collaborator

SurajAralihalli commented Mar 15, 2024

Refer to discussion in PR #1865 - In the most recent update, we've concluded that the Parser will now implement the copy_current_structure function to prevent redundancy in maintaining context. Additionally, we've also identified that nested generators are not required. Hence, I'll take down PR #1865 as PR #1868 will now address this issue. FYI @res-life

@res-life
Copy link
Collaborator Author

@SurajAralihalli
You may update your PR to do the normalization.

Normalization:
  0.001e-3 => 1.0E-6

Refer to NVIDIA/spark-rapids#10218
The parser now is handing strings properly, but does not do normalization for numbers.

@res-life
Copy link
Collaborator Author

Related Spark-Rapids issues: NVIDIA/spark-rapids#10218.
There are 4 items in #10218, item1, item2, item4 are finished.

@res-life
Copy link
Collaborator Author

number normalization PR: #1897

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants