Major Updates

🧪 User now can run ray-based distributed data processing under the guidance of added docs. #523
🧪 The DJ-Cookbook has gathered numerous high-quality data processing recipes from various vertical fields, and the related documents have been updated on the homepage. #542
💥 Change Task mode to Actor mode for ray deduplication, allowing users to use these operators without installing Redis. #526
🚀 Append a log summarization for warnings and errors at the running ending to make them recognizable under the sample fault tolerance mechanism. #534
🚀 Automatically update relevant documents when adding OPs to reduce the development burden. #527
🛝 Add usability tags for OPs:
- alpha tag for OPs in which only the basic OP implementations are finished;
- beta tag for OPs in which unittests are added based on the alpha version;
- stable tag for OPs in which OP optimizations related to DJ (e.g. model management, batched processing, OP fusion, ...) are added based on the beta version.

New OPs

image_segment_mapper: Perform segment-anything on images and return the bounding boxes. #550
mllm_mapper: Mapper to use MLLMs to generate texts for images. #550
sdxl_prompt2prompt_mapper: Use the generative model SDXL and image editing technique Prompt-to-Prompt to generate pairs of similar images. #550
sentence_augmentation_mapper: Augment sentences using LLMs. #550
text_pair_similarity_filter: Filter samples according to the similarity score between the text pair. #550

Add global skip_op_error param to enable fault-tolerant when execute DataJuicer analyzer and executor, but disable fault-tolerant for unit test. #528
Fix model force download bug. #529
Fix IndexError if the number of samples in the result dataset is less than the number of workers when saving dataset to disk. #536
Fix missing field meta tag on ray mode. #538
Update max_tokens or max_new_tokens for vllm-based OPs to avoid too short generation. #544
Fix bug in the role playing data generation demo. #545

Enhance unit test for API calling OPs. #528
Remove sandbox requirements installation from Dockerfile. #530
Update the datasource related APIs to be compatible with the latest version of Ray. #532
Limit the generated qa num for each text in generate_qa_from_text_mapper. #541
Update docs for preparing DJ2.0 release. #542
Update a quick cdn link for arch figure. #543
Add a video demo for role playing data generation. #545
Optimize op doc for global textual search. #552
Use a more stable and fast translator than google translator for automatic OP doc building. #554