fastq-elasticsearch

Collects some metadata about FASTQ files and stores them in elasticsearch

Build

mvn clean install

generates the bundle that contains all dependencies.

Start The fastq-elastic.sh can be used to start the app from a console.

Configuration The meta information about the sample files is stored in JSON document format. Before the fastq-elastic tool is started we have to prepare the mapping in the Elasticsearch. The mapping configuration can be found in the git repo (src/main/resources/sampledb-index.json).

Next step is the configuration of the fastq-elastic tool. You must set the custom values in the sample.conf file.

{
    elastic.host = localhost
    elastic.port = 9200

    # Supported file types
    file.extensions = [fastq.gz]

    # List of folders that should be parsed
    folders.root = [
        /sample/folder1,
        /sample/folder2
    ]

    # List of ignored folders
    folders.exclusive = []
}

Cheat sheet The most interesting part of the fastq-elastic service is what and how can we retrieve the collected data from the Elasticsearch. The following section shows some data queries that can be applied from the Kibana console.

Another general cheat sheet about the Kibana is http://elasticsearch-cheatsheet.jolicode.com/.

Counts the number of samples

GET sampledb/_doc/_count
{
  "query": {
    "wildcard": {
      "sample.samplePath": "*"
    }
  }
}

Get sample files that start with 'XXX-KM-34_S34'

GET sampledb/_doc/_search
{
  "query": {
    "wildcard": {
      "sample.sampleName.exact": "XXX-KM-34_S34*"
    }
  }
}

Get all sample file that name contain 'XXX5S' and field length > 30MB

GET sampledb/_doc/_search
{
  "query": {
    "bool": {
      "must": [
        {
          "wildcard": {
            "sample.sampleName.exact": "*XXX5S*"
          }
        }
      ],
      "filter": [
        {
          "range": {
            "sample.fileLength": {
              "gte": "30000000"
            }
          }
        }
      ]
    }
  }
}

Get the top 20 duplicated sample files

GET sampledb/_doc/_search
{
  "size": 0,
  "aggs": {
    "distinct_sample": {
      "terms": {
        "field": "sample.sampleName.exact",
        "size": 20
      }
    }
  }
}

Find largest sample file in MB using aggreagation (in 2 steps)

POST sampledb/_doc/_search
{
  "size": 0,
  "aggs": {
    "largest_sample": {
      "max": {
        "field": "sample.fileLength",
        "script": {
          "source": "_value / params.in_mb",
          "params": {
            "in_mb": 1048576
          }
        }
      }
    }
  }
}

GET sampledb/_doc/_search
{
  "query": {
    "match": {
      "sample.fileLength": 58362878472
    }
  }
}

Find top 3 largest sample files using query and sorting (in 1 step)

GET sampledb/_doc/_search
{
  "query": {
    "match_all": {}
  },
  "sort": [
    {
      "_script": {
        "type": "number",
        "script": {
          "lang": "painless",
          "source": "doc['sample.fileLength'].value / params.in_mb",
          "params": {
            "in_mb": 1048576
          }
        },
        "order": "desc"
      }
    }
  ],
  "size": 3
}

Get the sum of the size of the sample files in GB

GET sampledb/_doc/_search
{
  "size": 0,
  "aggs": {
    "largest_sample": {
      "sum": {
        "field": "sample.fileLength",
        "script": {
          "source": "_value / params.in_gb",
          "params": {
            "in_gb": 1073741824
          }
        }
      }
    }
  }
}

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
src		src
.classpath		.classpath
.gitignore		.gitignore
.project		.project
README.md		README.md
assembly.xml		assembly.xml
pom.xml		pom.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

fastq-elasticsearch

About

Releases

Packages

Languages

fejesa/fastq-elasticsearch

Folders and files

Latest commit

History

Repository files navigation

fastq-elasticsearch

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages