README.Rmd

---
output: github_document
---

```{r, echo = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "##",
  fig.path = "images/"
)
```
```{r echo=FALSE, results="hide", message=FALSE}
library("badger")
```

# readtext: Import and handling for plain and formatted text files

<!-- badges: start -->
[![CRAN Version](https://www.r-pkg.org/badges/version/readtext)](https://CRAN.R-project.org/package=readtext)
`r badge_devel("quanteda/readtext", "royalblue")`
[![Downloads](https://cranlogs.r-pkg.org/badges/readtext)](https://CRAN.R-project.org/package=readtext)
[![Total Downloads](https://cranlogs.r-pkg.org/badges/grand-total/readtext?color=orange)](https://CRAN.R-project.org/package=readtext)
[![R-CMD-check](https://github.com/quanteda/readtext/actions/workflows/R-CMD-check.yaml/badge.svg)](https://github.com/quanteda/readtext/actions/workflows/R-CMD-check.yaml)
[![Codecov test coverage](https://codecov.io/gh/quanteda/readtext/branch/master/graph/badge.svg)](https://app.codecov.io/gh/quanteda/readtext?branch=master)
<!-- badges: end -->


[1]: https://codecov.io/gh/quanteda/readtext/branch/master

An R package for reading text files in all their various formats, by Ken Benoit, Adam Obeng, Paul Nulty, Aki Matsuo, Kohei Watanabe, and Stefan Müller.

## Introduction

**readtext** is a one-function package that does exactly what it says on the tin: It reads files containing text, along with any associated document-level metadata, which we call "docvars", for document variables.  Plain text files do not have docvars, but other forms such as .csv, .tab, .xml, and .json files usually do.  

**readtext** accepts filemasks, so that you can specify a pattern to load multiple texts, and these texts can even be of multiple types.  **readtext** is smart enough to process them correctly, returning a data.frame with a primary field "text" containing a character vector of the texts, and additional columns of the data.frame as found in the document variables from the source files.

As encoding can also be a challenging issue for those reading in texts, we include functions for diagnosing encodings on a file-by-file basis, and allow you to specify vectorized input encodings to read in file types with individually set (and different) encodings.  (All encoding functions are handled by the **stringi** package.)

## How to Install


1.  From CRAN

    ```{r, eval = FALSE}
    install.packages("readtext")
    ```

2.  From GitHub, if you want the latest development version.

    ```{r, eval = FALSE}
    # devtools packaged required to install readtext from Github 
    remotes::install_github("quanteda/readtext") 
    ```

Linux note: There are a couple of dependencies that may not be available on linux systems. On Debian/Ubuntu try installing these packages by running these commands at the command line:

```{bash, eval = FALSE}
sudo apt-get install libpoppler-cpp-dev   # for antiword
```

## Demonstration: Reading one or more text files

**readtext** supports plain text files (.txt), data in some form of JavaScript Object Notation (.json), comma-or tab-separated values (.csv, .tab, .tsv), XML documents (.xml), as well as PDF, Microsoft Word formatted files and other document formats (.pdf, .doc, .docx, .odt, .rtf). **readtext** also handles multiple files and file types using for instance a "glob" expression, files from a URL or an archive file (.zip, .tar, .tar.gz, .tar.bz).

The file formats are determined automatically by the filename extensions.  If a file has no extension or is unknown, **readtext** will assume that it is plain text.  The following command, for instance, will load in all of the files from the subdirectory `txt/UDHR/`:

```{r}
library("readtext")
# get the data directory from readtext
DATA_DIR <- system.file("extdata/", package = "readtext")

# read in all files from a folder
readtext(paste0(DATA_DIR, "/txt/UDHR/*"))
```

For files that contain multiple documents, such as comma-separated-value documents, you will need to specify the column name containing the texts, using the `text_field` argument:

```{r}
# read in comma-separated values and specify text field
readtext(paste0(DATA_DIR, "/csv/inaugCorpus.csv"), text_field = "texts")
```

For a more complete demonstration, see the package [vignette](https://readtext.quanteda.io/articles/readtext_vignette.html).

## Inter-operability with other packages

### With **quanteda**

**readtext** was originally developed in early versions of the [**quanteda**](https://github.com/quanteda/quanteda) package for the quantitative analysis of textual data. Because **quanteda**'s corpus constructor recognizes the data.frame format returned by `readtext()`, it can construct a corpus directly from a readtext object, preserving all docvars and other meta-data.

```{r}
library("quanteda")
# read in comma-separated values with readtext
rt_csv <- readtext(paste0(DATA_DIR, "/csv/inaugCorpus.csv"), text_field = "texts")
# create quanteda corpus
corpus_csv <- corpus(rt_csv)
summary(corpus_csv, 5)
```

### Text Interchange Format compatibility

**readtext** returns a data.frame that is formatted as per the corpus structure of the [Text Interchange Format](https://github.com/ropenscilabs/tif), it can easily be used by other packages that can accept a corpus in data.frame format.  

If you only want a named `character` object, **readtext** also defines an `as.character()` method that inputs its data.frame and returns just the named character vector of texts, conforming to the TIF definition of the character version of a corpus.