Skip to content

Commit

Permalink
add txt360 to blog section
Browse files Browse the repository at this point in the history
  • Loading branch information
caris-mu committed Oct 10, 2024
1 parent 6cffb3c commit 961ff41
Show file tree
Hide file tree
Showing 2 changed files with 22 additions and 0 deletions.
Binary file modified .DS_Store
Binary file not shown.
22 changes: 22 additions & 0 deletions index.html
Original file line number Diff line number Diff line change
Expand Up @@ -333,6 +333,28 @@ <h1>Blogs</h1>
<!-- <li> <a class="tag" data-tag="software">software</a></li>-->
<!-- <li> <a class="tag" data-tag="research">research</a></li>-->
<!-- </ul>-->
<div class="blog-posts">
<section data-tags="announcement, dataset" class="blog-post">
<div class="row">
<div class="col-5">
<span class="image main">
<img src="images/txt360_title.png">
</span>
</div>
<div class="col-7">
<h3>TxT360: A Top-Quality LLM Pre-training Dataset Requires the Perfect Blend</h3>
<ul class="tags">
<li> <a class="tag" data-tag="announcement">announcement</a></li>
<li> <a class="tag" data-tag="dataset">dataset</a></li>
</ul>
<p>We introduce TxT360 (Trillion eXtracted Text), the first dataset to globally deduplicate 99 CommonCrawl snapshots and 14 high-quality data sources from diverse domains (e.g., FreeLaw, PG-19, etc.). The large-scale deduplication process and rich metadata stored enables precise control over data distribution. </p>
<ul class="actions">
<li><a href="https://huggingface.co/spaces/LLM360/TxT360" class="button" target="_blank">Learn more</a></li>
</ul>
</div>
</div>
</section>
<br>
<div class="blog-posts">
<section data-tags="Maitrix, announcement, benchmark" class="blog-post">
<div class="row">
Expand Down

0 comments on commit 961ff41

Please sign in to comment.