Liu Lab at Huazhong University of Science and Technology


Two novel RNA-binding proteins identification through computational prediction and experimental validation

This content is mainly used to illustrate the data processing process in the article "CLIP1 and DMD are two novel RNA-binding proteins through computational prediction and experimental validation". We hope that this content will help the researchers in need to use the sequencing data of CLIP1 and DMD. If you want to use the source code mentioned in this tutorial for other sequencing data, you only need to change CLIP1/DMD to your sample name.



phdRBP (Pipline for High-throughput Data analysis for RNA-Binding Protein) is available:
  • phdRBP.tar.gz (17.4GB)

  • Raw data:

  • Raw data or peak files can be downloaded from: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE128318
  • Data for disease is available:
  • Clinvar_cBioPortal.tar.gz (15.46M)

  • Uncompress and the usage of phdRBP:

    1. Download the phdRBP package.

    (1) Type "tar -zxvf phdRBP.tar.gz" to uncompress the package;

    (2) Type "cd phdRBP" to change the current directory;

    (3) For iRIP-seq, type "bash clean_data_iRIP-seq.sh" to pre-processing the raw data;

      for CLIP-seq, type "bash clean_data_CLIP-seq.sh" to pre-processing the raw data.

      After these step, you will get the cleaned data.

    2. If you want to use Piranha as the peak calling, please type:

      bash run_for_Piranha.sh

    3. If you want to use CIMS as the peak calling, please type:

      bash run_for_CIMS.sh



    Uncompress and the usage of Clinvar_cBioPortal:

    1. Download the Clinvar_cBioPortal package.

    (1) Type "tar -zxvf Clinvar_cBioPortal.tar.gz" to uncompress the package;

    (2) Type "cd Clinvar_cBioPortal" to change the current directory;

    (3) For ClinVar, type "cd ClinVar" to change the current directory;

      type "bash peak_clinvar.sh" to map the peak file to the ClinVar data to get the SNP in the peak file.

      After these step, you will get the mutation information of the peaks.

    (3) For Clinvar, type "cd cBioPortal" to change the current directory;

      type "bash run_iscancergene.sh" to to analyze the cancer information of the target protein, such as CLIP1.

      After these step, you will get the target RBP with its interaction parter co-occour in which cancer.





    The requested programs:

  • Cutadapt: A software for removing adapters(Version 1.12,2016-11-28). For more information, please see https://cutadapt.readthedocs.io/en/stable/installation.html
  • or you can type"pip install cutadapt==1.12" to install cutadapt


  • FASTX-Toolkit: It is a collection of command line tools for Short-Reads FASTA/FASTQ files preprocessing (Version 0.0.13,2010-02-02). It can be downloaded from http://hannonlab.cshl.edu/fastx_toolkit/fastx_toolkit-0.0.13.tar.bz2 For more information, please see http://hannonlab.cshl.edu/fastx_toolkit/download.html
  • or you can type:
      wget http://hannonlab.cshl.edu/fastx_toolkit/fastx_toolkit_0.0.13_binaries_Linux_2.6_amd64.tar.bz2
      tar -xjf fastx_toolkit_0.0.13_binaries_Linux_2.6_amd64.tar.bz2
      cp ./bin/* $HOME/bin


  • BEDTools: A powerful toolset for genome arithmetic. (Version 2.20.1). It can be downloaded from https://github.com/arq5x/bedtools2/releases/download/v2.20.1/bedtools-2.20.1.tar.gz For more information, please see https://github.com/arq5x/bedtools2
  • or you can type:
      wget https://github.com/arq5x/bedtools2/releases/download/v2.20.1/bedtools-2.20.1.tar.gz
      tar -zxvf v2.20.1.tar.gz
      cd bedtools2-2.20.1
      make
      cp bin/* $HOME/bin


  • Bowtie2: It is an ultrafast and memory-efficient tool for aligning sequencing reads to long reference sequences. (Version 2.2.5). It can be downloaded from https://sourceforge.net/projects/bowtie-bio/files/bowtie2/2.2.5/bowtie2-2.2.5-source.zip/download
  • after download, you can type:
      unzip bowtie2-2.2.5-source.zip
      cd bowtie2-2.2.5
      make
      cp bowtie2* $HOME/bin

  • TopHat2: It is a fast splice junction mapper for RNA-Seq reads (Version 2.1.1,2016-02-23). It can be downloaded from https://ccb.jhu.edu/software/tophat/downloads/tophat-2.1.1.Linux_x86_64.tar.gz For more information, please see https://ccb.jhu.edu/software/tophat/index.shtml
  • or you can type:
      wget https://ccb.jhu.edu/software/tophat/downloads/tophat-2.1.1.Linux_x86_64.tar.gz
      tar zxvf tophat-2.1.1.Linux_x86_64.tar.gz
      cd tophat-2.1.1.Linux_x86_64/
      cp -r * $HOME/bin


  • HTSlib: A C library for reading/writing high-throughput sequencing data (Version 1.9). It can be downloaded from https://github.com/samtools/htslib/releases/download/1.9/htslib-1.9.tar.bz2 For more information, please see http://www.htslib.org/
  • or you can type:
      wget https://github.com/samtools/htslib/releases/download/1.9/htslib-1.9.tar.bz2
      tar -jxvf htslib-1.9.tar.bz2
      cd htslib-1.9
      ./configure --prefix=$HOME
      make & make install

  • SAMtools: It is a suite of programs for interacting with high-throughput sequencing data. (Version 1.8,2018-04-03). It can be downloaded from https://sourceforge.net/projects/samtools/files/samtools/1.8/samtools-1.8.tar.bz2/download For more information, please see http://samtools.sourceforge.net/
  • after download, you can type:
      tar -jxvf samtools-1.8.tar.bz2
      cd samtools-1.8
      ./configure --prefix=$HOME --with-htslib="the path of htslib-1.9"
      make & make install


  • HOMER: It is a suite of tools for Motif Discovery and next-gen sequencing analysis (Version 4.8.2). It can be downloaded from http://homer.ucsd.edu/homer/data/software/homer.v4.8.2.zip. For more information, please see http://homer.ucsd.edu/homer/. Please refer to the "README.txt" in the homer folder for the installation method.

  • gencode.v23.annotation.gff3 and GRCh38.p3.genome.fa: Reference genomic data (Version 23). It can be downloaded from ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_23/
  • novoalign: It is used for mapping(V3.07.00). It can be downloaded from http://www.novocraft.com/support/download/
  • after download, you can type:
      Add the novocraft directory to your executable path
      i.e. edit your ~/.bash_profile file to include:
      PATH=$PATH:/home/novocraft

    build the index of novoalign by typing:
    ./novoindex -k 14 -s 1 GRCH38_gencode_v23.ndx GRCh38.p3.genome.fa
    or you can download the index of novoalign from http://www.rnabinding.com/phdRBP/data/GRCH38_gencode_v23.ndx

  • CTK: It provides a set of tools for analysis of CLIP data starting from the raw reads generated by the sequencer (Version 1.0.3,2016-08-08). It can be downloaded from https://zhanglab.c2b2.columbia.edu/index.php/CTK_Documentation
  • or you can type:
     1. wget https://cpan.metacpan.org/authors/id/C/CA/CALLAHAN/Math-CDF-0.1.tar.gz
      tar -zxvf Math-CDF-0.1.tar.gz
      cd Math-CDF-0.1
      perl Makefile.PL
      make & make install

     2. wget https://github.com/chaolinzhanglab/ctk/archive/v1.0.7.tar.gz
      tar -zxvf v1.0.7.tar.gz
      Add the ctk directory to your executable path:
      i.e. edit your ~/.bash_profile file to include:
      export PATH=$PATH:$HOME/ctk-1.0.7

  • Piranha: It is a tool developed for peak calling (Version 1.2.1). It can be downloaded from http://smithlabresearch.org/downloads/piranha-1.2.1.tar.gz
  • or you can type:
     1. wget http://mirrors.kernel.org/gnu/gsl/gsl-2.2.tar.gz
      tar -zxvf gsl-2.2.tar.gz   cd gsl-2.2
      ./configure --prefix=$HOME/gsl
      make & make install

      Add the gsl directory to your executable path:
      i.e. edit your ~/.bash_profile file to include:
      export C_INCLUDE_PATH=$C_INCLUDE_PATH:$HOME/gsl/include
      export CPLUS_INCLUDE_PATH=$CPLUS_INCLUDE_PATH:$HOME/gsl/include
      export LD_LIBRARY_PATH=$LD_LIBRARY_PATH::$HOME/gsl/lib
      export LIBRARY_PATH=$LIBRARY_PATH::$HOME/gsl/lib
     2. wget http://smithlabresearch.org/downloads/piranha-1.2.1.tar.gz
      cd piranha-1.2.1
      ./configure --prefix=$HOME
      make & make install

      Add the Piranha directory to your executable path:
      i.e. edit your ~/.bash_profile file to include:
      export PATH=$PATH:$HOME/piranha/piranha-1.2.1/bin
    Contact us:

    Any questions about phdRBP, please email to liushiyong@gmail.com.

    Reference:

    Juan Xie, Xiaoli Zhang, Jinfang Zheng, Xu Hong, Xiaoxue Tong, Xudong Liu, Yaqiang Xue, Xuelian Wang, Yi Zhang and Shiyong Liu
    Two novel RNA-binding proteins identification through computational prediction and experimental validation.
    Genomics, S0888-7543(21)00429-8, 15 December 2021


    Last modified: Fri. Oct. 30 10:32:00 CST 2020