diff --git a/.DS_Store b/.DS_Store index 16e3fba..da861da 100644 Binary files a/.DS_Store and b/.DS_Store differ diff --git a/index.html b/index.html index af2dd49..8722476 100644 --- a/index.html +++ b/index.html @@ -333,6 +333,28 @@

Blogs

+
+
+
+
+ + + +
+
+

TxT360: A Top-Quality LLM Pre-training Dataset Requires the Perfect Blend

+ +

We introduce TxT360 (Trillion eXtracted Text), the first dataset to globally deduplicate 99 CommonCrawl snapshots and 14 high-quality data sources from diverse domains (e.g., FreeLaw, PG-19, etc.). The large-scale deduplication process and rich metadata stored enables precise control over data distribution.

+ +
+
+
+