[FEA] GetJsonObject: Implement JSON generator to print JSON items #1831

res-life · 2024-03-05T05:34:58Z

Is your feature request related to a problem? Please describe.
Implement JSON generator to print JSON items

Additional context
Epic issue: #1823

item1
item2
item3 number normalization
item4

SurajAralihalli · 2024-03-07T23:28:48Z

In the cudf's get_json_object , writing the output to the result column happens in two phases. In the first phase the size of the output for each row (per thread) in computed. if (!out_buf.has_value()) { d_sizes[tid] = output_size; } json_path.cu#L952

Consider a JSON Column with 2 rows: { "ab" : "pqr" }, { "ab" : "lmn" }

get_json_object( col, "$.ab") returns a cudf column which is made from make_strings_column.
json_path.cu#L1046

col.size() -> 2
offsets -> 0,3
chars.release() -> pqrlmn
nullcount -> 0
validity -> 1,1

In the second phase, the offsets are used in each thread to write to the correct location in the device memory json_path.cu#L938C7

Memory before : xxxxxx
thread 1 writes at 0 -> pqrxxx
thread 2 writes at 3 -> xxxlmn
Memory after: pqrlmn

However Spark uses temporary memory while writing JSON. jsonExpressions.scala#L278.

// temporarily buffer child matches, the emitted json will need to be
// modified slightly if there is only a single element written
        val buffer = new StringWriter()

        var dirty = 0
        Utils.tryWithResource(jsonFactory.createGenerator(buffer)) { flattenGenerator =>
          flattenGenerator.writeStartArray()

          while (p.nextToken() != END_ARRAY) {
            // track the number of array elements and only emit an outer array if
            // we've written more than one element, this matches Hive's behavior
            dirty += (if (evaluatePath(p, flattenGenerator, nextStyle, xs)) 1 else 0)
          }
          flattenGenerator.writeEndArray()
        }

Potential issues can be:

Final parent memory can be lesser than child memory we might have to take the upper bound while allocating. This can lead to holes in the device memory which needs to be addressed.
As child memory is part of parent memory, there can be memory overlap while copying g.writeRawValue(buf.toString). This can lead to data corruption for nested queries.

res-life · 2024-03-07T23:34:55Z

This can lead to holes in the device memory which needs to be addressed.

We may put the leading char of child memory into a parameter of evaluatePath

  private def evaluatePath(
      p: JsonParser,
      g: JsonGenerator,
      style: WriteStyle,
      path: List[PathInstruction]): Boolean = {

==>>

  private def evaluatePath(
      p: JsonParser,
      g_leading_char: char,      // Add this parameter
      g: JsonGenerator,
      style: WriteStyle,
      path: List[PathInstruction]): Boolean = {

res-life · 2024-03-12T01:37:04Z

The interfaces I think is:
write_start_array();
write_end_array();
write_raw(); // invoke parser.copy_raw_text
write_raw_value(); // notice this, it's different with write_raw, should write comma or colon if needed.

    g.writeStartArray();
    g.writeRawValue("1");
    g.writeRawValue("2")
// will produce: [1,2
// the , char is added.

copy_current_structure(parser); // invoke parser.copy_text, note: I'll add this as soon as possible.
json_generator new_child_generator();
get_output_len();
get_output_start_position();
get_current_output_position();

Normalization:
0.001e-3 => 1.0E-6
......

res-life · 2024-03-12T11:56:26Z

Utility: #1863

SurajAralihalli · 2024-03-15T19:05:17Z

Refer to discussion in PR #1865 - In the most recent update, we've concluded that the Parser will now implement the copy_current_structure function to prevent redundancy in maintaining context. Additionally, we've also identified that nested generators are not required. Hence, I'll take down PR #1865 as PR #1868 will now address this issue. FYI @res-life

res-life · 2024-03-18T07:07:48Z

@SurajAralihalli
You may update your PR to do the normalization.

Normalization:
  0.001e-3 => 1.0E-6

Refer to NVIDIA/spark-rapids#10218
The parser now is handing strings properly, but does not do normalization for numbers.

res-life · 2024-03-20T02:12:18Z

Related Spark-Rapids issues: NVIDIA/spark-rapids#10218.
There are 4 items in #10218, item1, item2, item4 are finished.

res-life · 2024-03-26T07:35:37Z

number normalization PR: #1897

res-life added ? - Needs Triage feature request labels Mar 5, 2024

res-life assigned SurajAralihalli Mar 5, 2024

res-life mentioned this issue Mar 6, 2024

[FEA] GetJsonObject: Implement get-json-object in JNI repo as Spark does #1823

Closed

15 tasks

res-life mentioned this issue Mar 12, 2024

get-json-object: add utility write_escaped_text for JSON generator #1863

Merged

mattahrens removed the ? - Needs Triage label Mar 12, 2024

res-life mentioned this issue Mar 13, 2024

get-json-object: main flow #1868

Merged

SurajAralihalli mentioned this issue Mar 13, 2024

[WIP] Implement JSON Generator to write Output from Parser. #1865

Closed

res-life assigned res-life and thirtiseven Mar 19, 2024

thirtiseven mentioned this issue Mar 26, 2024

getJsonObject number normalization #1897

Merged

res-life closed this as completed Mar 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] GetJsonObject: Implement JSON generator to print JSON items #1831

[FEA] GetJsonObject: Implement JSON generator to print JSON items #1831

res-life commented Mar 5, 2024 •

edited

Loading

SurajAralihalli commented Mar 7, 2024 •

edited

Loading

res-life commented Mar 7, 2024

res-life commented Mar 12, 2024 •

edited

Loading

res-life commented Mar 12, 2024

SurajAralihalli commented Mar 15, 2024 •

edited

Loading

res-life commented Mar 18, 2024

res-life commented Mar 20, 2024

res-life commented Mar 26, 2024

[FEA] GetJsonObject: Implement JSON generator to print JSON items #1831

[FEA] GetJsonObject: Implement JSON generator to print JSON items #1831

Comments

res-life commented Mar 5, 2024 • edited Loading

SurajAralihalli commented Mar 7, 2024 • edited Loading

res-life commented Mar 7, 2024

res-life commented Mar 12, 2024 • edited Loading

res-life commented Mar 12, 2024

SurajAralihalli commented Mar 15, 2024 • edited Loading

res-life commented Mar 18, 2024

res-life commented Mar 20, 2024

res-life commented Mar 26, 2024

res-life commented Mar 5, 2024 •

edited

Loading

SurajAralihalli commented Mar 7, 2024 •

edited

Loading

res-life commented Mar 12, 2024 •

edited

Loading

SurajAralihalli commented Mar 15, 2024 •

edited

Loading