Posts

Showing posts from 2017

Update on the DEE project Dec 2017

Back in 2015, our group described DEE, a user friendly repository of uniformly processed RNA-seq data, which I covered in detail in a previous post. Ours was the first such repository that wasn't limited to human or mouse and included sequencing data from a variety of instruments and library types. The purpose of this post is to reflect on the mixed success of DEE and outline where this project is going in future.

Overall I've received a lot of positive feedback from users and a number of citations to our poster. Thanks to everyone who used, gave suggestions, comments, bug reports, etc! However our attempt to have the repository published wasn't so successful due to reviewer niggles over what I consider minor points but hard to implement quickly. The main points raised by reviewers were:

Is it reasonable to treat all data sets as if they were single end? For this one, the reviewers were split, one said it was OK and the other was adamant that it was unacceptable despite my …

Diagnosing PCR duplicates from cluster duplicates

Image
NovaSeq, HiSeqX and HiSeq4000 Illumina sequencers have patterned flowcells which have a different chemistry as compared to random clustered flowcell systems (Hiseq2500 & MiSeq) which is known to cause duplicates during the clustering process. For some background on the issue, see these previous blog posts:

QC Fail blog Steve WingettEnseqlopedia blog by James Hadfield In my recent whole genome bisulfite sequencing experiment using TruSeq methylation library prep kits and NovaSeq, I noticed a high proportion of duplicate reads and wanted to investigate whether these were "cluster" duplicates, ie generated during the clustering process due to ExAmp chemistry or were duplicates generated during the PCR step. Generally cluster duplicates occur in the immediate proximity on the flowcell surface and PCR duplicates are expected to occur uniformly throughout the flowcell surface.
To diagnose this, I used the diagnose-dups tool by Dave Larson which can be found on Github here. I wr…

Considerations in performing whole human genome bisulfite sequencing on the Illumina NovaSeq system

Image
Today at the NGS workshop at WEHI, Melbourne, I presented some findings related a pilot study of 12 methylomes studied with whole genome bisulfite sequencing. Two of those libraries were also sequenced on the HiSeq4000 platform to similar depth so there were some subtle but interesting differences between the systems. What we found was that the actual sequence coverage obtained was substantially less than that projected due to 2 problems. Firstly that the insert size was too small - which looks like it could be due to the inner workings of the Illumina TruSeq methylation kit. And secondly that there was a high proportion of duplicate reads observed - that is same strand and coordinates which are likely not independent observations. I will need to look into further detail at whether these are PCR duplicates or "cluster" duplicates. Perhaps the library prep or clustering protocols need some tweaking for bisulfite sequencing.

So as promised, here is the link to the slides.

Upset plots as a replacement to Venn Diagram

Image
I previously posted about different ways to obtain Venn diagrams, but what if you have more than 4 lists to intersect? These plots become messy and not easy to read. One alternative which has become popular is the upset plot. There is an excellent summary of the philosophy behind this approach in this article and academic paper here. An example plot is below:



In this post, I'll describe how to get from lists of genes in text files and present it as an UpSet plot using R. As with most R packages, you'll find that loading in the data is the hardest part, and that data import is the least documented aspect.

First I'll generate some random gene lists using a quick and dirty shell script. My complete list contains 58302 genes and looks like this:
$ head -5 Homo_sapiens.GRCh38.90.gnames.txt ENSG00000000003_TSPAN6 ENSG00000000005_TNMD ENSG00000000419_DPM1 ENSG00000000457_SCYL3 ENSG00000000460_C1orf112
This is the script which generates random subsets of genes with the suffix &quo…

Minitalk: Understanding gene regulation in complex disease with deep sequencing

Image
Today I gave a presentation on experiment design and use of ChIP-seq and MBD-seq to understand gene regulation. The target audience consisted of biomedical scientists with little background in genomics but were curious to incorporate deep sequencing into their studies.

Link to the slides HERE.


As always I love getting feedback - so leave your questions and comments below!

Shell aliases for bioinformatics

Using shell allows us to take advantage of some nice features to make our bioinformatics lives a little easier for things we do very frequently. In Ubuntu, the ~/.bashrc file is run as a new terminal window is opened to customise the shell. Here are a few of my favourite general shortcuts. Let me know your favourites in the comments section below!

#shorten ls forms
alias ll='ls -alF'
alias la='ls -A'
alias l='ls -CF'


#shorten file viewing
alias h='head'
alias t='tail'
alias n='nano -S'
#the -S option to nano makes scrolling smoother
alias nano='nano -S'

#easy update
alias update='sudo apt-get update && sudo apt-get upgrade -y'

#search through history
alias hgrep='history | grep'

#Get col headers of tab delim file
ch(){
cat $1 | tr '\t' '\n' | nl -n ln
}
export -f ch

#login with ssh where IP is constant (X is the IP address)
alias login1='ssh -Y username@X.X.X.X' #scp can be done as above

#login with …

Minitalk: on Excel Gene Name Errors

Image
It was great to visit the Monash Clayton Bioinformatics team led by David Powell today to introduce myself and speak about a topic very close to my heart!

Slides below:

Also let me know what you think of the new theme of the blog in the comments below. BTW Just realised this is my 100th post! Yay for me! Thanks for reading!

How NGS is transforming medicine

Image
Last month, I gave a talk at our departmental meeting, describing in general terms how high throughput sequencing technology was having real impacts in medicine and human health, as well as some emerging trends to watch out for in coming years.

Here's the link


Introducing the ENCODE Gene Set Hub

Image
TL;DR We curated a bunch of ENCODE data into gene sets that is super useful in pathway analysis (ie GSEA).
Link to gene sets and data: https://sourceforge.net/projects/encodegenesethub/
Poster presentation: DOI:10.13140/RG.2.2.34302.59208

Now for the longer version. Gene sets are wonderful resources. We use them to do pathway level analyses and identify trends in data that lead us to improved interpretation and new hypotheses. Most pathway analysis tools like GSEA allow us to use custom gene sets, this is really cool as you can start to generate gene sets based on your own profiling work and that of others.

There is huge value in curating experimental data into gene sets, as the MSigDB team have demonstrated. But overall, these data are under-shared. Even our group is guilty of not sharing the gene sets we've used in papers. There have been a few papers where we've used gene sets curated  from ENCODE transcription factor binding site (TFBS) data to understand which TFs were drivi…