2016-11-16

Study & Analysis

Study & Analysis

  • Principles
    • “Someone unfamiliar with your project should be able to look at your computer files and understand in detail what you did and why.”
    • "Everything you do, you will have to do over and over again” – Murphy’s law
  • Structuring data for analysis
    • Poor organizational choices lead to significantly slower research progress.
    • It is critical to make results reproducible.

A reproducibility crisis

Organisation - So far, so good…

Now what?

I guess this is alright…

Which one is the most recent?

Another (bad) common approach

A possible solution

Still missing something

  • Need context => document metadata
    • How was the data generated?
    • From what was the data generated?
    • What where the (experimental) conditions?
    • Etc, etc, …
  • Use research domain standards
  • Use a structured format
    • machine-readable if possible

Some suggested best-practices

  • There is a folder for the raw data, which do not get altered, or intermixed with data that is the result of manual or programmatic manipulation. I.e., derived data is kept separate from raw data, and raw data is not duplicated.
  • Code is kept separate from data.
  • There is a scratch directory for experimentation. Everything in the scratch directory can be deleted at any time without negative impact.
  • There should be a README in every directory, describing the purpose of the directory and its contents.
  • Manuscript production output is kept separate from everything else.
  • Use non-proprietary formats – .csv rather than .xlsx
  • Etc…

Non-proprietary formats

  • A text-based format is more future-safe, than a proprietary binary format by a commercial vendor
  • Never, never, never use Excel for science!
  • Markdown is a nice way of getting nice output from text.
    • Simple & readable formating
    • Can be converted to lots of different outputs
      • HTML, pdf, MS Word, slides etc

Directory structure

Or like this…

├── bin  <---------------# Binary files and executables (jar files & proj-wide scripts etc)
├── conf <---------------# Project-wide configuraiotn
├── doc  <---------------# Any documents, such as manuscripts being written
├── experiments  <-------#  The main experiments folder
│   ├── 2000-01-01-exa <-# An example Experiment
│   │   ├── audit  <-----# Audit logs from workflow runs (higher level than normal logs)
│   │   ├── bin   <------# Experiment-specific executables and scripts
│   │   ├── conf  <------# Experiment-specific config
│   │   ├── data  <------# Any data generated by workflows
│   │   ├── doc   <------# Experiment-specific documents
│   │   ├── log   <------# Log files from workflow runs (lower level than audit logs)
│   │   ├── raw   <------# Raw-data to be used in the experiment (not to be changed)
│   │   ├── results  <---# Results from workflow runs
│   │   ├── run   <------# All files rel. to running experiment: Workflows, run confs/scripts...
│   │   └── tmp   <------# Any temporary files not supposed to be saved
├── raw  <---------------# Project-wide raw data
├── results  <-----------# Project-wide results
└── src  <---------------# Project-wide source code (that needs to be compiled)

_From Samuel Lampa’s blog: http://bionics.it/posts/organizing-compbio-projects_

Project organization

Source acknowledgements

Hands-on session