ALL Metrics
-
Views
-
Downloads
Get PDF
Get XML
Cite
Export
Track
Method Article

Projection layers improve deep learning models of regulatory DNA function

[version 1; peer review: 1 approved, 1 approved with reservations]
PUBLISHED 05 Feb 2019
Author details Author details
OPEN PEER REVIEW
REVIEWER STATUS

This article is included in the Artificial Intelligence and Machine Learning gateway.

Abstract

With the increasing application of deep learning methods to the modelling of regulatory DNA sequences has come an interest in exploring what types of architecture are best suited to the domain. Networks designed to predict many functional characteristics of noncoding DNA in a multitask framework have to recognise a large number of motifs and as a result benefit from large numbers of convolutional filters in the first layer. The use of large first layers in turn motivates an exploration of strategies for addressing the sparsity of output and possibility for overfitting that result. To this end we propose the use of a dimensionality-reducing linear projection layer after the initial motif-recognising convolutions. In experiments with a reduced version of the DeepSEA dataset we find that inserting this layer in combination with dropout into convolutional and convolutional-recurrent architectures can improve predictive performance across a range of first layer sizes. We further validate our approach by incorporating the projection layer into a new convolutional-recurrent architecture which achieves state of the art performance on the full DeepSEA dataset. Analysis of the learned projection weights shows that the inclusion of this layer simplifies the network’s internal representation of the occurrence of motifs, notably by projecting features representing forward and reverse-complement motifs to similar positions in the lower dimensional feature space output by the layer.

Keywords

sequence analysis, deep learning, gene regulation

Introduction

The abundance of data characterising the function of noncoding DNA at high resolution facilitates the use of complex data-driven methods to learn the sequence features known as ‘motifs’ that encode this function1. A number of works have used neural networks to model human regulatory DNA, taking as input fixed-length regions of DNA sequence and predicting properties such as transcription factor binding, chromatin accessibility and histone marks using data collected by ENCODE and other consortia27. Several of these networks are intended to simultaneously model a wide variety of the functional characteristics of the input region, by predicting hundreds or even thousands of such measurements across multiple cell types in a multi-task learning framework. With hundreds of known regulatory motifs recorded in databases such as JASPAR8, machine learning models capable of fully characterising a significant variety of the measurable functional properties of human noncoding DNA must be able to recognise a large number of distinct patterns in the input sequence. Indeed while existing approaches have varied in the details of their neural network architectures, they have tended to share the use of relatively large numbers of convolutions as motif scanners in the first layer, and differed mainly in the subsequent layers where standard convolutions, dilated convolutions and recurrent layers have all been used to model interactions between features26. The best reported performance on the DeepSEA benchmark was achieved by a network having 1024 convolutional kernels in its first layer4; indeed even when experimenting with single-output networks designed to predict binding for a single transcription factor9, observed that the performance of convolutional networks continued increasing with the number of filters in the first layer up to over 100 filters1.

The use of a sufficient number of first layer filters to capture the variety of motifs relevant to the task at hand thus appears to be an important consideration in the design of neural networks for processing noncoding DNA sequences. At the same time, it raises questions. For one thing, the use of a large number of parameters in the first layer raises the possibility of overfitting. Moreover, first layers designed to recognise large numbers of specific motifs are bound to produce outputs which are relatively sparse and high-dimensional, which may hamper learning in subsequent layers10. Finally, these layers are computationally expensive, particularly when applied to long sequences, both due to the cost of computing the activations by convolving the input at each point in the sequence, and the cost in the next layer of processing sequences of high-dimensional activation vectors.

Standard regularisation techniques such as dropout11 may be expected to help alleviate the problem of overfitting, and have been applied to the first convolutional layer in previous works. But there is room for further work both in terms of characterising the extent of the problem and investigating alternative solutions. Projection layers, which can be used to reduce the dimensionality of a representation without reducing its resolution, are a popular component of deep networks in computer vision where they are often referred to as 1 × 1 convolutions12,14,15. Reducing the dimensionality of a layer’s activations reduces the number of parameters required in the subsequent layer, as well as the cost of computing that layer’s activations. At the same time, depending on the nature of the features learned in the first layer, the denser representation resulting from the projection may well preserve much of the information contained therein. Even random projections are well known to preserve distances in dense representations16,17.

The common practice of including amongst the training inputs both forward and reverse-complement versions of each target sequence in particular motivates the exploration of a more compressed representation. Models are forced by this form of data augmentation to recognise distinct instances (forward and reverse-complement) of functionally equivalent motifs. Methods capable of identifying these two instantiations therefore offer the same capacity at potentially lower representational cost. Recognition of this issue has motivated the development of layers specially adapted to ensure the identity of forward and reverse-complementary sequences18. The use of projections offers an alternative approach to this problem. Here we focus on the design choices related to the capacity of multi-task networks to recognise a sufficient variety of motifs in input sequences, by jointly exploring both the effect of the number of first layer filters and the use of projection and dropout as approaches designed to mitigate the disadvantages of a large first layer. We choose to address these questions using the DeepSEA dataset2, since this has previously been used to benchmark different network architectures4,6. Initially using a reduced version of the dataset with shortened input regions, we vary the number of first layer filters for standard convolutional and convolutional-recurrent architectures with and without a projection layer and dropout, with our results indicating the importance of regularisation and the performance benefits of projection. We incorporate the projection layer into a convolutional recurrent neural network architecture with a number of modifications from the DanQ architecture proposed by 4. This new architecture achieves state of the art performance on the full DeepSEA dataset.

Methods

Baseline architectures

We experiment with modifications to two classes of architecture which have been successfully applied for multitask prediction in regulatory genomics. Details of the hyperparameters we used when training versions of these models are provided in the sections describing the relevant experiments.

  • 1. Convolutional neural network (CNN): Both DeepSEA2 and Basset3 use 3 layer CNNs, consisting of a stack of 3 convolution and max-pooling operations followed by one or more fully connected layers. DeepSEA’s convolutional layers are regularized using dropout and a global L2 penalty, whereas Basset applies batch normalization after each convolutional layer.

  • 2. DanQ: The DanQ convolutional-recurrent architecture consists of a single convolutional layer followed by a pooling layer and a bidirectional long short-term memory (LSTM)19. The full sequence of LSTM outputs are passed through two fully connected layers in order to generate predictions4. reported results for two versions of this architecture, DanQ and DanQ-JASPAR, differing in the sizes of the layers and in the initialization used for the first layer, with half of the better-performing DanQ-JASPAR’s 1024 first-layer filters being initialized using known motifs from the JASPAR database. Like DeepSEA, both DanQ architectures use dropout after their single convolutional layer.

Linear projection layer

We investigate the use of a linear projection applied to the pooled activations of the first layer of architectures of both types. In detail, suppose that the first layer has m 1D convolutional filters and that after pooling the length of the sequence representation is l. Then the pooled activations form a sequence (a1, a2al) of m-dimensional vectors. The output of the projection layer is a sequence (v1, v2vl ) of k-dimensional vectors (k < m):

vi=Pai(1)

where P is a weight matrix of size k × m. The projection layer’s output is a sequence of the same length as the sequence of the first layer’s pooled filter activations, but whose members are vectors of a lower dimension, with the same projection matrix P being used to reduce the dimension at each point in the sequence. All the results reported below were obtained using a value of k = 64, which seemed to represent a good trade-off between dimensionality reduction and preservation of information.

Improved convolutional-recurrent architecture

The best previously reported performance on the DeepSEA dataset was achieved by the DanQ-JASPAR architecture which uses a single large convolutional layer followed by a max-pooling layer with stride and pool size of 15. This layer summarises the presence of the motifs identified by the convolutional layer across relatively large 15bp stretches of input. Pooling so aggressively has the advantage of controlling the length of sequence to be fed into the LSTM, preventing computation in the recurrent layer from becoming prohibitively time consuming.

We hypothesise that this pooling involves throwing out useful positional information, which could be better preserved by splitting the downsampling across two sets of convolution and pooling layers rather than a single one. Therefore we propose an alternative convolutional recurrent (CRNN) architecture, which adds a projection layer, a second convolutional layer and a second pooling operation between the pooled outputs of the first convolutional layer and the bidirectional LSTM. To ensure fair comparison, the overall level of downsampling in the convolution and pooling layers is the same as in the DanQ-JASPAR networks, such that the length of the sequence of inputs to the bidirectional LSTM is the same (64) in both cases. In common with the DanQ networks we use a single fully-connected hidden layer before the output layer, but in order to control overfitting we use as input to this layer not the full sequence of LSTM outputs but their global mean. The proposed network, full details of which are given below, has far fewer parameters than DanQ-JASPAR and trains faster.

Experiments

The DeepSEA dataset. The DeepSEA dataset consists of sequences of 1000 bp from the human noncoding genome, labelled for the presence of a peak in the central 200 bp in the signal for each of 919 chromatin features taken from ENCODE and Roadmap20,21. These features represent a range of transcription factor binding, chromatin accessibility and histone modification measurements across a variety of cell types. Both forward and reverse-complement versions of the sequence corresponding to each set of targets are included in the dataset, meaning that models must be capable of learning both forward and reverse-complement motifs. We use the original training, validation and test splits and follow4 in using as our primary evaluation metric test set area under the precision recall curve (AUPRC), which is calculated after averaging predictions across forward and reverse complement versions of each sequence.

Design choices related to first layer on reduced DeepSEA dataset. In our first set of experiments we seek to rigorously explore the optimal configuration of the early layers of instantiations of both CNN and DanQ network designs. We vary the number of first layer filters, the use of dropout immediately after the first pooling layer, and the use of a projection layer (we fix the output dimension of this layer at each point in the sequence to 64) while keeping other hyperparameters fixed for a version of each class of architecture. When dropout and projection are used together, the dropout is applied after the projection layer. A dropout rate of 0.2 is used in all cases, which is the same as that applied to the activations of the first convolutional layer in both the DeepSEA and DanQ architectures. Other modifications to the original architectures were made in the interests of retaining comparable performance while reducing computational cost and are described below.

The CNN model that we choose to explore here takes from Basset the use of 3 convolutional layers, with kernel sizes of 19, 11 and 7 respectively, but varying in several other details. We use max pooling operations of sizes 6, 2, and 2 after the convolutional layers. The number of filters in the second and third convolutional layers is held fixed at 128 and 256 respectively. The outputs of the final pooling operation are fed into a single hidden layer of 2048 neurons to which dropout with dropout factor of 0.5 is applied. Leaky ReLUs22 are used for all activations. Our DanQ architectures follow the original in most details other than those under investigation, except for the use of Leaky ReLU rather than ReLU activations, and the use of a reduced number of LSTM cells (100) in each direction. To mitigate the cost of these experiments, we run them on a reduced version of the DeepSEA dataset, using only the central 500 bp of each 1000 bp sequence. For all networks we use the Adam optimizer23 with an initial learning rate of 3 × 104 to minimize the multitask binary cross entropy loss via mini-batch gradient descent with a batch size of 256. The learning rate was reduced by a factor of 5 if the validation loss did not decrease for two epochs. Training was terminated if the validation loss did not improve for five epochs. All models were implemented in Keras24 using the Theano backend25.

Evaluation of CRNN architecture on full DeepSEA dataset. For the second set of experiments we use the full 1000 bp for each sequence and seek to compare the performance of our improved CRNN architecture to that of DeepSEA and the two DanQ architectures. For comparison with the two variants of DanQ, DanQ and DanQ-JASPAR, which have, respectively, 320 and 1024 filters in the first layer, we explore two variants of our CRNN architecture with 320 and 700 filters of length 30 in the first layer. To evaluate the contribution of the projection layer, for each CRNN variant we train one network with projection after the first pooling operation, and one network without projection but otherwise identical to the first. All networks use a second convolutional layer with 128 filters of length 11 whose activations are pooled and fed into a bidirectional LSTM with 300 units in each direction. Max-pooling with stride and pool size of 7 after the first convolutional layer and 2 after the second convolutional layer together with unpadded convolutions ensure that the sequence of inputs to the LSTM is of the same length as in the DanQ-JASPAR model. Dropout with a rate of 0.15 is applied to the projected first layer activations if projection is used, and to the pooled first layer activations if not. Recurrent dropout26 with a rate of 0.2 is applied to the LSTM. Leaky ReLUs are used for all layer activations. Networks are trained using the same learning rate schedules as in the previous set of experiments. We compare the average test set AUPRCs of our models with those of the publicly available trained DeepSEA and DanQ networks.

The source code for all models and experiments is available on GitHub and Zenodo27.

Results

Effects of first layer design choices on reduced-size dataset

In both fully convolutional and convolutional-recurrent architectures consistent benefits were achieved by increasing the number of first layer filters, with gradual saturation of performance (as measured by test set AUPRC averaged across the tasks) at around 1000 filters in both cases (Figure 1). In the fully convolutional networks the benefit of the projection layer was very clear, with all networks which used projection outperforming those that didn’t, often by considerable margins. A combination of dropout and projection achieved the best performance in every case. There is less evidence of benefit in the case of the networks using DanQ-style architectures, with networks with regularisation sometimes outperforming those without, but a lack of a clear pattern in the results, at least under the test set AUPRC metric. This is despite models incorporating dropout and the projection layer consistently achieving lower cross-entropy loss on the validation set. One factor in the difference between the two types of architectures is the degree of overfitting that the standard, unregularised architecture suffers. We observed that fully convolutional architectures showed a much greater tendency to overfit than convolutional-recurrent architectures (Figure 2). We note that unlike a convolutional layer, an LSTM already learns its own projection in the form of the weight matrix which transforms the inputs into the internal state space within the input and forget gates. These internal projections may help reduce both the tendency to overfit and the potential performance improvement associated with incorporating an additional projection layer. In contrast, inserting a projection layer into a CNN architecture substantially reduces the degree of overfitting (Figure 2), which allows CNN networks including projection layers continue to benefit from adding additional filters in the first layer, whereas without projection, CNN performance hardly improves beyond 500 first layer filters, as the benefit of extra feature detectors is offset by the increased likelihood of overfitting.

8d829739-7b2a-4a14-94b4-6a37ffab31fb_figure1.gif

Figure 1. Test set AUPRC as function of first layer size for CNNs (left) and DanQ networks (right) with and without projection layer (projecting down to 64 dimensions) and dropout (dropout rate of 0.2).

Jitter was added to the number of first layer filters for DanQ architectures to enable the points to be distinguished.

8d829739-7b2a-4a14-94b4-6a37ffab31fb_figure2.gif

Figure 2. Validation set loss curves for CNN (left) and DanQ (right) models with 500 first layer filters, either with or without the two regularisation strategies.

The CNN network shows much more evidence of overfitting.

Projection layer helps improved CRNN architecture outperform other models on full DeepSEA data

Table 1 shows the cross entropy losses on the validation and test sets for our best-performing convolutional recurrent (CRNN) models as well as published baselines. CRNN-700 achieves the best average test set AUPRC of the compared models while being significantly less costly to train than DanQ-JASPAR, and without requiring the use of any known motifs to initialize first layer filters, as DanQ-JASPAR does. For both CRNN models we also compare the performance of models with and without the projection layer. In both cases, the projection layer leads to a clear increase in performance and a reduction in the cost per epoch of training the network.

Table 1. Performance of CRNN models with and without projection layer compared to DeepSEA and DanQ networks.

CRNN-n is a model with 2 convolutional layers with n and 128 filters respectively, with kernel sizes of 30 and 11, followed by a bidirectional LSTM with 300 units in each direction, whose outputs are averaged and fed through a hidden layer with 919 units which in turn feeds into the output layer. CRNN-n-projection is identical to CRNN-n except for the inclusion of a projection layer between the first and second convolution layers, which effectively reduces the dimension of the first layer’s activations from n to 64. Losses and AUPRCs for DanQ and DeepSEA networks are calculated using the publicly available model weights files. AUPRCs for all models are calculated after averaging predictions for forward and reverse complement versions of each test sequence, whereas forward and reverse complement versions of each sequence contribute independently to the reported losses.

ModelParametersValid LossTest LossTest AUPRC
DeepSEA155,159,8390.05090.05540.343
DanQ46,926,4790.04910.05380.371
DanQ-JASPAR67,892,1750.04820.05330.379
CRNN-320-projection2,477,4790.04850.05320.383
CRNN-3202,727,3350.04890.05400.375
CRNN-700-projection2,547,779 0.0475 0.0526 0.391
CRNN-7003,174,5950.04840.05330.385

Projection layer simplifies learning by unifying representations for forward and reverse-complement motifs

To understand the nature of the performance benefits brought by the use of the projection layer, we can investigate the relationship between the projection weights learned and the motifs learned by the first convolutional layer. To associate a motif with each filter in the first layer we follow a procedure similar to that introduced by 1: several thousand sequences from the training set are passed through the trained model, and for each first layer convolutional filter we record the identities of the nucleotides at each position in the maximally-activating stretch of input in each sequence in which that filter is activated. From this we construct a PFM which can be converted into a motif representing the typical input pattern recognised by the filter. Using TOMTOM28 to search the JASPAR 2018 database8 we find that 257 of the 700 learned motifs of the best-performing CRNN-700-projection model have at least one significant match (q < 0.01). Each learned motif is also associated with one of the columns in the 64 × 700 weight matrix of the projection layer. Suppose for example that at a certain point in an input sequence, the motif recognised by the ith convolutional filter occurs. Assuming none of the other filters are activated by this motif or its neighbouring region, the network’s representation of this region of the input will then just be the vector obtained by multiplying each weight in the ith column of the projection matrix by the filter’s activation. Thus the ith column of the projection matrix can be interpreted as representing an embedding of the motif learned by the ith convolutional filter. To visualise these embeddings, we choose to focus on a subset of the learned motifs which have the best matches to known motifs, selecting only the 44 learned motifs with q-values less than 10−8. The result of performing a PCA on the 44 columns of the projection weight matrix associated with these motifs is shown in Figure 3. Most strikingly, different versions of the same motif tend to cluster together, with the embeddings for filters which learn to recognise the forward version of a particular motif very often close to those for filters which recognise the reverse complement of the same motif. This suggests that the projection layer allows for a more efficient internal representation of motifs, recognising that forward and reverse complement patterns are functionally equivalent although completely different and therefore requiring different feature extractors at the sequence level. This representation of functional equivalence allows networks with a projection layer to harness the benefits of reverse-complement data augmentation without paying a price in terms of representational complexity.

8d829739-7b2a-4a14-94b4-6a37ffab31fb_figure3.gif

Figure 3. PCA of projection weights corresponding to learned motifs with best matches to known motifs in JASPAR database.

Each point represents one of the 64 dimensional column vectors of the projection weight matrix. Only columns corresponding to learned motifs with a match with q -value less than 10−8 are included in the PCA to aid visualisation. Points are labelled by name of matched motif and whether it is the forward (f) or the reverse complement (rc) version of the known motif that is matched. Points are coloured by transcription factor family (cyan: C2H2 zinc finger, green: basic leucine zipper, red: homeodomain, purple: basic helix-loop-helix, blue: all other).

Discussion

Despite the recent progress in the application of deep learning methods to model genomic data there remains work to be done in understanding the types of architecture and design choices best suited to the domain. We provide further evidence here that the performance of networks whose goal is to predict hundreds of functional properties from the DNA sequence is strongly dependent on the number of convolutional filters in the first layer. In networks where the subsequent layer is also convolutional, performance can be further improved by inserting a dimensionality-reducing projection layer between the two sets of convolutions. A similar use of projection layers in networks designed to predict enhancers was independently proposed by29 while we were finalising this manuscript. Their network takes as input both DNA sequences and chromatin accessibility information, and intersperses projections and convolutions on each of the two data modalities. While their work shows that projections can be used in highly performing architectures for regulatory genomics problems, they did not explore the role of projections in achieving this performance. Here our aim is to draw particular attention to the performance benefits and mode of functioning of a single projection layer, inserted directly after a first DNA motif-recognising convolutional layer, since we believe these point to its potential utility beyond any single application. In particular, we show that the projection layer is capable of learning the identity between forward and reverse-complement versions of functionally equivalent motifs and thereby simplifying the representation of the functional content of the sequence. It also reduces the number of parameters required in the subsequent layer, leading to less overfitting (particularly in combination with dropout) and reducing the computational cost. Incorporating the projection layer into a convolutional-recurrent network architecture similar to the DanQ architecture leads to improved performance on the DeepSEA dataset with fewer parameters and shorter per-epoch training times. Although we have only tested the use of the projection layer on the DeepSEA dataset, we believe that its use could be of important benefit in other situations in which accurate prediction of the targets requires recognition of a large variety of motifs in the input sequence.

Data availability

DeepSEA dataset: the full dataset, including train, test and validations splits, may be downloaded from http://deepsea.princeton.edu/help/.

Software availability

Source code for models and experiments available at: https://github.com/alex-hh/motif_projection.

Archived source code at time of publication: https://doi.org/10.5281/zenodo.254379629.

License: MIT license.

Comments on this article Comments (0)

Version 1
VERSION 1 PUBLISHED 05 Feb 2019
Comment
Author details Author details
Competing interests
Grant information
Copyright
Download
 
Export To
metrics
Views Downloads
F1000Research - -
PubMed Central
Data from PMC are received and updated monthly.
- -
Citations
CITE
how to cite this article
Hawkins-Hooker A, Kenlay H and Reid JE. Projection layers improve deep learning models of regulatory DNA function [version 1; peer review: 1 approved, 1 approved with reservations]. F1000Research 2019, 8:151 (https://doi.org/10.12688/f1000research.17125.1)
NOTE: If applicable, it is important to ensure the information in square brackets after the title is included in all citations of this article.
track
receive updates on this article
Track an article to receive email alerts on any updates to this article.

Open Peer Review

Current Reviewer Status: ?
Key to Reviewer Statuses VIEW
ApprovedThe paper is scientifically sound in its current form and only minor, if any, improvements are suggested
Approved with reservations A number of small changes, sometimes more significant revisions are required to address specific details and improve the papers academic merit.
Not approvedFundamental flaws in the paper seriously undermine the findings and conclusions
Version 1
VERSION 1
PUBLISHED 05 Feb 2019
Views
16
Cite
Reviewer Report 21 Oct 2019
Qin Ma, Department of Biomedical Informatics, College of Medicine, Ohio State University, Columbus, OH, USA 
Approved with Reservations
VIEWS 16
The authors proposed a novel approach to model the regulatory activity of DNA sequences and, by incorporating the projection layer into a new convolutional-recurrent architecture. The proposed method obtains state-of-the-art performance on the full DeepSEA dataset. In the experiment process, ... Continue reading
CITE
CITE
HOW TO CITE THIS REPORT
Ma Q. Reviewer Report For: Projection layers improve deep learning models of regulatory DNA function [version 1; peer review: 1 approved, 1 approved with reservations]. F1000Research 2019, 8:151 (https://doi.org/10.5256/f1000research.18723.r55081)
NOTE: it is important to ensure the information in square brackets after the title is included in all citations of this article.
Views
28
Cite
Reviewer Report 18 Feb 2019
David R. Kelley, Calico Life Sciences LLC, South San Francisco, CA, USA 
Approved
VIEWS 28
Summary
The authors suggest the use of “projection” layers for modeling the regulatory activity of DNA sequences. “Projection” layers perform a linear transformation from a large representation size to a smaller one, by applying a width one convolution. They ... Continue reading
CITE
CITE
HOW TO CITE THIS REPORT
Kelley DR. Reviewer Report For: Projection layers improve deep learning models of regulatory DNA function [version 1; peer review: 1 approved, 1 approved with reservations]. F1000Research 2019, 8:151 (https://doi.org/10.5256/f1000research.18723.r44128)
NOTE: it is important to ensure the information in square brackets after the title is included in all citations of this article.

Comments on this article Comments (0)

Version 1
VERSION 1 PUBLISHED 05 Feb 2019
Comment
Alongside their report, reviewers assign a status to the article:
Approved - the paper is scientifically sound in its current form and only minor, if any, improvements are suggested
Approved with reservations - A number of small changes, sometimes more significant revisions are required to address specific details and improve the papers academic merit.
Not approved - fundamental flaws in the paper seriously undermine the findings and conclusions
Sign In
If you've forgotten your password, please enter your email address below and we'll send you instructions on how to reset your password.

The email address should be the one you originally registered with F1000.

Email address not valid, please try again

You registered with F1000 via Google, so we cannot reset your password.

To sign in, please click here.

If you still need help with your Google account password, please click here.

You registered with F1000 via Facebook, so we cannot reset your password.

To sign in, please click here.

If you still need help with your Facebook account password, please click here.

Code not correct, please try again
Email us for further assistance.
Server error, please try again.