Assessing the potential relevance of CEACAM6 as a blood transcriptional biomarker

Background Changes in blood transcript abundance levels have been associated with pathogenesis in a wide range of diseases. While next generation sequencing technology can measure transcript abundance on a genome-wide scale, downstream clinical applications often require small sets of genes to be selected for inclusion in targeted panels. Here we set out to gather information from the literature and transcriptome datasets that would help researchers determine whether to include the gene CEACAM6 in such panels. Methods We employed a workflow to systematically retrieve, structure, and aggregate information derived from both the literature and public transcriptome datasets. It consisted of profiling the CEACAM6 literature to identify major diseases associated with this candidate gene and establish its relevance as a biomarker. Accessing blood transcriptome datasets identified additional instances where CEACAM6 transcript levels differ in cases vs controls. Finally, the information retrieved throughout this process was captured in a structured format and aggregated in interactive circle packing plots. Results Although it is not routinely used clinically, the relevance of CEACAM6 as a biomarker has already been well established in the cancer field, where it has invariably been found to be associated with poor prognosis. Focusing on the blood transcriptome literature, we found studies reporting elevated levels of CEACAM6 abundance across a wide range of pathologies, especially diseases where inflammation plays a dominant role, such as asthma, psoriasis, or Parkinson’s disease. The screening of public blood transcriptome datasets completed this picture, showing higher abundance levels in patients with infectious diseases caused by viral and bacterial pathogens. Conclusions Targeted assays measuring CEACAM6 transcript abundance in blood may be of potential utility for the management of patients with diseases presenting with systemic inflammation and for the management of patients with cancer, where the assay could potentially be run both on blood and tumor tissues.


Introduction
Changes in blood transcript abundance can reflect differences in relative abundance of leukocyte populations as well as transcriptional regulation secondary to immune activation (for instance inflammation, interferon, and prostaglandin responses).Quantifying these changes can thus be relevant for making clinical decisions. 1,2Robust technology platforms, such as microarrays and RNA sequencing, that enable the measurement of transcript abundance in an unbiased fashion (i.e., simultaneously measuring all RNA species that are present in a given sample) have been widely available for the past two decades.4][5][6][7] In addition, vast amounts of blood transcriptome profiling data have been made available in public repositories such as the NCBI Gene Expression Omnibus, or EMBL-EBI's array express. 8anscriptome profiling data can be leveraged to inform the design of targeted gene panels.These panels can serve as a basis for the development of diagnostic assays for use in clinical settings.But targeted assays can also be employed in research settings, for instance when profiling of transcript abundance needs to be performed on large scales (e.g., in thousands of samples) and with a relatively short turnaround.Notably, targeted assays could also prove valuable in resource-constrained settings, where computing infrastructure, instrument, and reagents costs are limiting.The approaches employed for targeted assay design can be data-driven (e.g., applying computational models to transcriptome profiling dataset(s) to select genes based on their predictive performance) or knowledge-driven (selecting genes based on pre-existing knowledgee.g., for the development of an "immunology panel").However, both data and knowledgedriven approaches can also be combined.This is illustrated in recently published work in which we describe the selection of three blood transcriptional panels designed for the monitoring of responses to SARS-CoV-2. 9 Transcripts were selected first based on their membership to co-expressed gene sets, the abundance of which was found to change during COVID-19 disease (i.e., through a data-driven approach) and second based on their relevance to one of three themes, which were immunity, therapeutic development, and severe acute respiratory syndrome biology (i.e., through a knowledge-driven approach).However, the amount of information available in the literature and in public transcriptome datasets that can be leveraged for candidate gene selection can be overwhelming.Thus, we have developed an approach to identify, retrieve, structure, and aggregate such information in a manner that would support the rational selection of candidate genes for inclusion in targeted assays destined to be used in clinical or research settings. 10re we decided to focus on CEACAM6, a gene encoding a protein of the carcinoembryonic antigen (CEA) family whose members are glycosylphosphatidylinositol (GPI)-linked cell surface proteins. 11,12The methodology employed in this study is derived from our previously established "collective omics data" (COD) training curriculum, 13 as outlined in our comprehensive methods paper, "A training curriculum for retrieving, structuring, and aggregating information derived from the biomedical literature and large-scale data repositories." 10This foundational paper provides a detailed description of our systematic approach to information curation, which we have applied in the current investigation of CEACAM6.Specifically, the study utilizes the COD1 training module workflow from this curriculum, which guides the structured retrieval and aggregation of gene-specific data for biomarker assessment.The process encompasses selecting a gene of interest, in this case, CEACAM6, to comprehensively gather and synthesize relevant information from both literature and public datasets, culminating in the creation of resources like structured data tables and interactive circle packing plots.This approach not only supports the rigorous assessment of CEACAM6's potential as a blood biomarker but also serves as a demonstrative application of our validated methodological framework, providing a practical example of how such a framework can be employed to enhance biomarker discovery efforts.
5][16] Associations REVISED Amendments from Version 1 In this revised version of our article, we have clarified the manuscript's focus, emphasizing that the study serves as a proof of concept applying a previously established methodological framework to the CEACAM6 gene.We also address the potential of automation in data curation, discussing our exploration into the use of Large Language Models (LLMs) to enhance efficiency and accuracy.Furthermore, we have updated the discussion around CEACAM6 as a therapeutic target, acknowledging ongoing research and clinical trials that explore its potential, particularly in the context of cancer therapy.These revisions ensure the article accurately reflects current research and methodological advancements, providing a comprehensive and up-to-date overview of the subject.Additionally, we have incorporated a new reference (PMID: 32257432), shedding light on the role of CEACAM6 in individuals with positive fecal immunochemical tests but no intestinal lesions.This further elucidates CEACAM6's availability in circulating neutrophils and enhances our manuscript's precision in representing CEACAM6 as a blood biomarker.
Any further responses from the reviewers can be found at the end of the article were also found for pancreatic, lung, and breast cancer, as well as leukemia and inflammatory bowel disease.More in depth profiling of the literature (analyzing the full text) identified an array of conditions for which CEACAM6 abundance has been found to be significantly different from controls.This list was complemented by a screening of public blood transcriptome datasets.The tables employed to capture this information in a structured format are shared as extended data files.Another deliverable is the interactive circle packing plot that permits aggregation and seamless access to this and all underlying information.Altogether these resources supported manuscript preparation and interpretation/evaluation by the authors of the relevance of CEACAM6 as a biomarker.They may also support transcript selection efforts of members of the research community interested in designing blood transcriptional biomarker panels.

Methods
Overall literature and large-scale dataset profiling approach The workflow implemented here to assess the potential of CEACAM6 as a blood transcriptional biomarker has been described in detail in a separate methods paper. 10The approach was devised as part of a training module focused on the development of skills for the retrieval, structuring, aggregation, and interpretation of information derived from the literature and publicly available large-scale profiling datasets.Relevant resources that have been employed and generated in the context of this work are presented in Table 1.Briefly, the process is broken down into the following steps: (1) Selecting a candidate gene: the most basic criterion is for transcripts for this gene to be detectable in blood.It could also be selected based on its membership in a pre-defined signature or gene set.
( (3) Profiling the candidate gene's literature at a high level: the literature associated with the candidate gene is identified (see "literature profiling section" below for details).Entities corresponding to a given theme (e.g., diseases, cell types, or molecular processes) are extracted from the title of those articles ("breast cancer" is an example of a disease entity).This permits to identify the main diseases associated with the gene of interest, and, in turn, identify instances in which the candidate gene has been found to be of actual or potential utility as a biomarker for these diseases.
(4) Profiling the literature in more depth: taking advantage of Google Scholar's full text search capabilities, this step identifies publications where the abundance level of the candidate gene's transcripts in blood samples was found to be different in patients compared with appropriate controls.
(5) Profiling the abundance of the gene across multiple relevant transcriptome datasets: to complement the previous step, public blood transcriptome datasets are screened to identify instances where the abundance level of the candidate gene's transcripts in blood differs in patients in comparison with appropriate controls.
(6) Developing resources supporting manuscript preparation and evaluation of the candidate gene: the information parsed from the literature or transcriptome datasets in earlier steps is recorded in a structured format (e.g., using a standard spreadsheet template, see details below).Using the Prezi web application (Prezi Inc., San Francisco, CA, USA), this information is aggregated in interactive circle packing plots.Spreadsheets and interactive circle plots can next be used to assess the overall relevance of the gene of interest as a candidate blood transcriptional biomarker and support the writing of the manuscript.They can also serve as a resource for investigators interested in designing blood transcriptional biomarker panels.
BloodGen3 blood transcriptional module repertoire CEACAM6 was selected based on its membership to one of the 382 modules constituting the fixed BloodGen3 module repertoire.This repertoire has been recently characterized. 17Briefly, it was constructed based on co-expression analysis through a process that was exclusively data-driven.First, the 16 reference blood transcriptome datasets that served as input were clustered separately using K-means clustering.Co-clustering events observed across the 16 reference datasets were then recorded for each gene pair.This information served as a basis for the constitution of a large co-clustering network, with nodes representing genes and edges representing co-clustering events.A weight of 1 to 16 was attributed to the graph edges depending on the number of times co-clustering events were observed.The network was then mined using graph theory to identify densely connected subnetworks that were identified as modules and added to the repertoire.This process eventually yielded 382 non-overlapping modules (at the probe level, multiple probes mapping to the same gene could be found across different modules).Next, the repertoire was thoroughly characterized functionally and an R package was developed to support BloodGen3 module repertoire analysis and visualization.

Literature profiling
The approach has been described in two published study guides: from a high-level perspective as part of the COD1 workflow 10 and in more detail in a separate study guide dedicated to literature profiling. 19An overview of the steps implemented in the profiling of the literature associated with CEACAM6 is provided here: (1) Literature retrieval: to identify the literature associated with the candidate gene, a PubMed query is designed by combining the official gene name and symbol along with known aliases.Troubleshooting is performed as needed to minimize false positives and false negatives.For CEACAM6 the following query was generated and, as of August (2) Extraction of relevant concepts: the titles of the articles associated with CEACAM6 are screened for keywords associated with diseases or physiological states and with cell types.For example, if the theme is "diseases or physiological states", diseases entities such as "breast cancer", "influenza infection", "pregnancy" or "systemic lupus erythematosus" may be identified in the title of articles associated with the gene of interest.
(3) Generating literature profiles: next, the prevalence of the cell types or disease entities identified in the previous step in the candidate gene's literature is determined.Focusing on a subset of the literature, information regarding the potential relevance of the candidate gene as a biomarker can be captured in a structured format in an Excel spreadsheet.
(4) Aggregating information: the underlying literature profiling information is captured and visually represented in interactive circle packing plots using the Prezi application (Prezi Inc, San Francisco, CA, USA).This serves as a basis for generating manuscript figures and the constitution of a companion resource that can be made accessible to the community.

Information retrieval and structuring
While screening the literature and large-scale profiling datasets trainees learn to identify and extract key information from research articles or transcriptome datasets.These include basic information, as well as elements of study design (e.g., analyte name, type, species, biological samples, measurement methods, sample size) and findings (e.g., fold change, significance).The information is captured in a standard MS Excel spreadsheet template, which can be used to record information derived from both the literature and transcriptome profiling datasets (Extended Data File 1 20 ).

Interactive circle packing plots
Information extracted from the literature and from public transcriptome datasets was aggregated in an interactive circle packing plot generated using the Prezi web application (Prezi Inc., San Francisco, CA, USA).A free basic Prezi account can be setup for this (https://prezi.com/pricing/basic/). Starting from a blank presentation, it consisted of adding and populating circles (topics) and organizing them into a hierarchy (https://prezi.com/view/pQ7TKEC6tgY3cuik9ckt/).Color-coding the circles and varying their size permitted the visualization of some of the results.Excerpts or full articles were added, as well as plots representing CEACAM6 transcriptional data profiles.Links to articles and interactive versions of the figures were also provided in order promote seamless access to information.

Transcriptome profiling data analyses and visualization
Screening of transcriptome profiling datasets consisted of determining whether differences between levels of CEACAM6 transcript abundance in patients and their respective controls were significant.The CEACAM6 profiling data were downloaded from the "CD2K" gene expression browser (GXB) instance (http://cd2k.gxbsidra.org/dm3/geneBrowser/list) for multiple blood transcriptome datasets. 21Analyses were conducted separately for each dataset in Microsoft Excel (RRID:SCR_016137), testing for differences in variance using F-test statistics and testing for differences in expression using t-test statistics.Differences were considered significant when p was <0.05.Plots were generated using Plotly chart studio (RRID:SCR_013991, https://chart-studio.plotly.com/create/).

Selection of CEACAM6
The first step consisted of selecting a gene that would be next evaluated for its potential relevance as a blood transcriptional biomarker.CEACAM6 was selected primarily based on its membership to a blood transcriptional signature of interest.This signature is part of a fixed blood transcriptional module repertoire (BloodGen3, see Ref. 17 and methods for details).The M10.4 module signature is functionally associated with neutrophil activation and comprises 11 other genes: BPI, LTF, CEACAM8, DEFA1, DEFA1B, DEFA2, DEFA4, OLFM4, ELANE, CTSG, and MPO (https://prezi.com/view/pQ7TKEC6tgY3cuik9ckt/Step 1: candidate gene selection).In a reference collection of 16 patient cohorts, 17 abundance levels of M10.4 transcripts were the highest in subjects with Staphylococcus aureus infection, respiratory syncytial virus infection and bacterial sepsis (Figure 1).

General background information about CEACAM6
As part of the evaluation process, it can be useful to start by retrieving and synthesizing background information about the candidate gene.For this, summaries from different reference databases, as well as introductions from recent publications on CEACAM6, were retrieved.This information was recorded in the CEACAM6 interactive circle packing plot (https:// prezi.com/view/pQ7TKEC6tgY3cuik9ckt/Step 2: gathering background information) and used for development of the narrative below.
CEACAM6 is a glycosyl phosphatidyl inositol (GPI)-anchored cell surface glycoprotein.It is a member of the carcinoembryonic antigen (CEA) family whose members are known to play a role in cell adhesion. 22Specifically, CEACAM6 expression has been reported in granulocytes and lung and intestinal epithelial cells. 23In ileal epithelial cells of patients with Crohn's disease, CEACAM6 has been found to act as a receptor for adherent-invasive Escherichia coli. 24t has also been found to mediate entry of Neisseria gonorrhoeae. 25CEA family members are widely used as tumor markers in serum as well as tumor immunoassays.CEACAM6 has been reported to act as an oncogene, promoting tumor progression and metastasis. 26These properties may, at least in part, be effected via the role of CEACAM6 in promoting anoikis resistance, which prevents the homeostatic elimination of anchorage-dependent cells (such as epithelial cells) that are detached from the cellular matrix. 27Since CEACAM6 membrane expression is highly specific to tumor cells, it has been suggested as a target for different cancer immunotherapies. 28It has also recently been identified as an immune checkpoint molecule, based on its role in suppressing cytotoxic T cell responses against malignant plasma cells. 29ofiling the CEACAM6 literature at a high-level reveals an association with neutrophils and several types of cancers To further our understanding of the biological significance and clinical relevance of CEACAM6, we next sought to systematically screen the literature to identify associations with cell populations and diseases or physiological states.
Altogether, this step established that CEACAM6 is associated with a large body of literature.It also permitted the identification of the main cell types and diseases associated with this gene.This information was used in subsequent literature profiling steps.

CEACAM6 is of potential clinical relevance in the diagnosis of cancers, in particular, the early detection of colorectal carcinoma
The selection of a blood transcriptional panel could take into consideration whether a given candidate gene has already been determined to be of clinical relevance as a biomarker, whether that is at the gene, transcript, or protein level.Thus, we next sought to determine if this was the case for CEACAM6 by extracting relevant information from its literature for the main disease entities identified in the previous steps.
The approach is described in detail in the methods section.In brief, starting from the CEACAM6-associated literature we searched for publications reporting the actual or potential use of CEACAM6 as a biomarker.For this we focused more specifically on the diseases that showed the highest degree of association with CEACAM6 based on the above literature profiling results (i.e., diseases mentioned in more than 20 articles, which are listed in Table 2), namely: leukemia, colorectal, pancreatic, lung, and breast cancers, as well as Inflammatory bowel disease.Next, articles associated with CEACAM6 and these diseases that also mentioned "biomarker", "diagnostic", "diagnosis", "prognostic" OR "prognosis" in their title or abstract were retrieved.For articles deemed to be of interest, a standard spreadsheet template was used to capture relevant information (Extended Data File 3 31 ).Information was also aggregated in an interactive circle packing plot using the Prezi web application (https://prezi.com/view/pQ7TKEC6tgY3cuik9ckt/CEACAM6/Step3: background literature profiling/CEACAM6_Diseases_Biomarker).Together, the information thus gathered served as a basis for the development of the narrative below.
As aforementioned, CEACAM6 has been noted for its oncogenic properties.Our screening of the CEACAM6 literature, which relates more specifically to its potential relevance as a biomarker in various disease settings, supports this notion.Indeed, a higher abundance of CEACAM6, whether at the transcript or protein level, in tumor tissues or serum was always associated with worse survival (in the case of colorectal, 32,33 breast, 34,35 pancreatic, [36][37][38][39][40] and lung cancers 41 ).Other studies have found CEACAM6 to be of potential value for differential diagnosis of malignant vs benign tumors for breast cancer (with CEACAM6 protein levels measured in breast tissues 42 ) and pancreatic cancer (with CEACAM6 protein levels measured in the bile 43 ).Notably, and of particular relevance to this report, in the case of colorectal carcinoma, measuring the abundance of CEACAM6 at the protein and transcript levels in blood alongside TSPAN8, LGALS4, and COL1A2 has been found to be of potential value for early disease detection. 14,15Furthermore, recently CEACAM6 was also included in a 10-gene signature predictive model for lung cancer prognosis. 44together, this review of the literature shows that measurement of CEACAM6, whether at the transcript or protein level, in tumor tissues or in blood, is considered of potential clinical value in informing the management of different types of cancers, as summarized in Table 3.
In depth screening of the literature shows that blood levels of CEACAM6 transcripts are elevated in a wide range of diseases More specifically we next sought to assess the relevance of CEACAM6 as a blood transcriptional biomarker.The first pass at screening the literature (above) already identified instances where measuring blood CEACAM6 transcript is deemed of potential clinical value (i.e., for the early detection of colorectal cancer [14][15][16] or the prognosis of lung cancer 44 ).We wanted to undertake a second pass to profile the literature in more depth to identify additional studies that reported differences in the abundance of CEACAM6 transcripts in blood in patient populations.
Queries were run using Google Scholar, which supports full text search.Entries were screened manually, selecting only peer-reviewed reports where CEACAM6 levels were measured in the blood of human subjects.Relevant information was recorded in a structed format in a spreadsheet using the standard template employed in the previous step.Finally, information was aggregated in the interactive CEACAM6 Prezi circle packing plot.
Differences in CEACAM6 blood transcript levels have been reported in the literature for a wide range of pathologies.Specifically, in addition to the colorectal carcinoma and lung cancer studies described above, it was found to be part of a 13-gene disease signature which was increased in patients with Parkinson's disease as compared with asymptomatic subject. 45It was also part of a different 13-gene disease signature that was increased in patients with severe idiopathic pulmonary fibrosis compared with patients with a mild form of the disease. 46Notably, other members of this latter signature, including CTSG, DEFA3, and OLFM4, are also comprised in the M10.4 module that is part of the fixed BloodGen3 repertoire mentioned above.Other pathologies and states where blood CEACAM6 transcript levels were found to be increased are summarized in Table 4, and include asthma, 47 sepsis, 48 post-traumatic stress disorder, 49 psoriasis, 50 maternal anti-fetal rejection, 51 and COVID-19. 52,53It was also found to differ based on gender (higher in male than in females) 54 and notably was also increased by steroid treatment. 55These latter two findings suggest that in instances where demographics or use of steroids are not well-controlled for in the study design, differences in CEACAM6 transcript levels might be, at least in part, attributed to these factors rather than the underlying pathology.For reference, a full record of the information captured from the literature regarding those studies can be found in Extended Data File 4. 56 Additional information is also found aggregated in the CEACAM6 interactive circle packing plot (https://prezi.com/view/pQ7TKEC6tgY3cuik9ckt/ CEACAM6/Step3: background literature profiling/CEACAM6_Diseases_Biomarker).
Taken together, this in-depth review of the literature points to differences in CEACAM6 blood transcript abundance being present in patients in a wide range of diseases.Thus, suggests that assays measuring levels of CEACAM6 transcripts in blood may be employed to support biomarker development efforts across different clinical settings.
Screening of public blood transcriptome datasets to identify elevated levels of CEACAM6 in additional disease settings Literature reports might capture only a fraction of instances where pathophysiological changes are accompanied by changes in the abundance of CEACAM6 blood transcripts.Screening publicly available transcriptome datasets could confirm published reports and help identify other instances where levels of CEACAM6 transcript abundance differ in patients relative to control subjects.
For this, we employed a data browsing web-application, the Gene eXpression Browser (GXB), 20 which provides easy access to transcriptional profiles of individual genes in curated collections of transcriptome datasets.For instance, we screened blood transcriptome data for a collection of 16 reference cohorts that were used for the construction of the BloodGen3 repertoire.These datasets are available in the CD2K instance of GXB (http://cd2k.gxbsidra.org/dm3/geneBrowser/list).CEACAM6 transcriptional profiles were retrieved for each of these cohorts and statistics run separately using MS Excel to determine the significance of changes in levels of CEACAM6 transcripts in patients vs controls (Extended Data File 5 57 ).Changes were captured in a structured format, plotted, and aggregated in the CEACAM6 circle packing plot.
We found differences in levels of CEACAM6 transcript abundance for nine of the 16 reference BloodGen3 datasets (Table 4, Extended Data File 6 58 ).The pathological or physiological states for which differences were observed did not overlap with those also listed in Table 4 that were identified in the previous step by in depth screening of the literature.Indeed, we found elevated abundance levels of CEACAM6 in patients with infections caused by Staphylococcus aureus, influenza, respiratory syncytial virus, human immunodeficiency virus, and bacterial pathogens causing sepsis, in comparison with controls (Figure 3).CEACAM6 transcript levels were not increased in patients with tuberculosis.Significant increases were also observed in non-communicable diseases such as systemic onset juvenile arthritis and Kawasaki disease but not in the context of systemic lupus erythematosus, late-stage melanoma, or chronic obstructive pulmonary disease.Finally, we also found a significant increase in abundance in the blood of liver transplant recipients under immunosuppressive therapy and in pregnant women.This transcriptome profiling dataset screen complemented our earlier literature screen, identifying nine additional diseases or physiological states in which CEACAM6 transcript is significantly changed in the blood of patients, for a total of 25 distinct diseases/states which are listed in Table 4. Plots for the nine BloodGen3 datasets are available via the GXB application and have been replotted and loaded to the CEACAM6 circle packing plot (https://prezi.com/view/pQ7TKEC6tgY3cuik9ckt/CEACAM6/Step5: blood tx profiling/CEACAM6_Blood Tx).
Overall, the screening of a reference dataset collection indicated that differences in CEACAM6 levels could be observed in a wide range of conditions in which systemic inflammation is observed.The lack of overlap between the literature and transcriptome data profiling conducted in steps 4 and 5 suggests that expanding this search to a larger number of blood transcriptome datasets would likely significantly add to this list.To date, no drugs have been developed that target CEACAM6 Another criterion for inclusion of CEACAM6 in a focused assay could be its targeting by approved drugs or drugs currently under development.The "Open Targets" database does not report any known drugs, approved or currently under development, targeting CEACAM6 (https://platform.opentargets.org/target/ENSG00000086548).However, given its recently described role as suppressor of effector CD8 T-cells, 29 CEACAM6 is currently considered an immune checkpoint molecule and as such could be targeted by drugs designed to block its activity in cancer patients. 28dditionally, in preclinical mouse models antibodies targeting CEACAM6 have been shown to inhibit tumor growth and metastasis. 26,59ofiling reference transcriptome datasets shows CEACAM6 transcript expression to be restricted to circulating neutrophils Finally, screening of reference public transcriptome datasets can also yield insights regarding the candidate gene's regulation and restriction among circulating leukocytes.Thus, in addition to profiling 16 public blood transcriptome datasets, we examined CEACAM6 transcriptional profiles in two other reference datasets.One dataset measured transcript abundance in monocytes, neutrophils, B-cells, CD4+ T-cells, CD8+ T-cells and natural killer (NK) cells and in whole blood (GSE60424 60 ).The second dataset measured changes in transcript abundance in whole blood exposed in vitro to a wide range of immune stimuli (toll-like receptor agonists, killed bacteria, viruses, inflammatory cytokines and interferons; GSE30101 61 ).In addition, we screened the Broad Institute's single cell portal 62 for datasets in which CEACAM6 expression was elevated in one or more of the cell clusters.Bulk leukocyte population RNAseq data showed CEACAM6 expression to be restricted to neutrophils (Figure 4) [data source: Linsley et al. 60 ].This observation was confirmed in a single-cell dataset in which tumor immune cell infiltrates were dissociated and profiled via RNA sequencing (Figure 5) [data source: He et al. 63 ].These findings were in line with the prevalence among the CEACAM6 literature of publications mentioning this cell type (https://prezi.com/view/pQ7TKEC6tgY3cuik9ckt/ CEACAM6/Step 3: background literature profiling/CEACAM6_Cell Types) (Figure 2A).However, we did not find CEACAM6 to be increased in whole blood stimulated in vitro (Figure 6) [data source: Obermoser et al. 61 ].[66] Taken together, further profiling of reference transcriptome datasets confirmed the close association of CEACAM6 with neutrophils, which is the most abundant circulating leukocyte population in blood.It also indicates that elevated levels of CEACAM6 transcript abundance observed across a wide range of conditions may be associated with an increase in relative abundance of cells expressing this gene, rather than regulation of its expression.

Discussion
Clinical translation of biomarker signatures obtained via transcriptome profiling technologies typically involves the development of targeted transcript panels and assays.Such assays can also prove more practical for high-temporal frequency immunological monitoring applications that require profiling of thousands of samples.They could also be more readily implemented in the context of research projects conducted in low-resource settings.Targeted panel design can be informed by both data-driven and knowledge-driven approaches.However, given the large amounts of data and knowledge available for any given candidate gene, the selection process can prove daunting.Here we employed a workflow devised for screening the literature and large-scale profiling data associated with a given candidate gene, and to retrieve and aggregate relevant information in a structured format.This information and associated resources should in turn support decision-making of investigators aiming to develop targeted panels for downstream clinical or research applications.We focused on CEA cell adhesion molecule 6 (CEACAM6).This candidate is a member of blood transcriptional signatures that are often functionally associated with neutrophil activation, [64][65][66] which typically also includes genes encoding constituents of neutrophil granules, such as defensins (DEFA1, DEFA3, DEFA4), myeloperoxidase (MPO), bactericidal permeability increasing protein (BPI), and lactotransferrin (LTF).Several criteria can be used when prioritizing candidate genes for inclusion in a targeted assay, which we have applied here to CEACAM6: 1) Transcripts are detectable in blood and changes can be observed across different immune states/pathologies; this criterion is met in the case of CEACAM6.An increase in levels of CEACAM6 transcripts has been reported in the literature and observed in blood transcriptome datasets for patients with infectious (e.g., bacterial sepsis), autoimmune, or inflammatory diseases (e.g., systemic lupus erythematosus, Kawasaki disease).
2) Previous reports describe the candidate as being of clinical relevance as a biomarker; this criterion is also met.8][69] CEACAM6 itself is deemed of potential value as a prognosis marker in different types of cancers. 33,34,38,40,41[16] 3) The functional relevance of the candidate gene in blood leukocytes is known; this criterion is partially met.CEACAM6 is associated with neutrophils in the literature.This was confirmed in our screen of reference transcriptome datasets, both at the bulk leukocyte population and single cell levels (Figures 4 & 5).However, the role played by CEACAM6 in neutrophils has not yet been fully elucidated.For instance, another reference dataset showed that CEACAM6 expression is not regulated in blood exposed in vitro to a wide range of immune stimuli (Figure 6).This finding casts some doubts on whether "neutrophil activation" should be assigned to the signature associated with CEACAM6 (by us and others).These observations may also be consistent with an earlier report that associated a "granulopoiesis signature", which comprised CEACAM6, with low density mononuclear and polymorphonuclear populations found in peripheral blood mononuclear cell fractions. 70urthermore, single-cell analyses recently conducted in COVID-19 patients identified a population of "developing neutrophils" that expressed neutrophil granule proteins, including module M10.4 members such as MPO, DEFA3, LTF, and ELANE, and were described as potentially being derived from plasmablasts. 71Altogether these observations suggest that measuring levels of M10.4 transcripts might permit the monitoring of changes in abundance in this population of developing neutrophils rather than reflecting overall neutrophil abundance.However, this hypothesis and the functional relevance of this subset of neutrophils remains to be validated experimentally.4) The candidate gene is a target for drugs that are approved or under development; Recent studies and ongoing clinical trials have explored the utility of targeting CEACAM6 in various cancers, particularly through the development of monoclonal antibodies.For instance, preclinical evaluations have demonstrated the potential of CEACAM6 as a therapy target in pancreatic adenocarcinoma, utilizing antibody-drug conjugates to effectively target and diminish CEACAM6-expressing tumors. 72Additionally, the blocking of CEACAM6-CEACAM1 interactions has shown promise in enhancing T cell-mediated cancer cell elimination, suggesting a role for CEACAM6 in immune modulation and its potential as an immune checkpoint target. 73The breadth of research, encompassing studies on its prognostic value and therapeutic targeting in cancers, underscores CEACAM6's significance in oncology and its emerging role as a viable therapeutic target.These investigations, reflected in various studies 74,75 and a clinical trial registered under NCT03596372, collectively indicate a growing interest in CEACAM6 as a therapeutic target, warranting further exploration and validation in clinical settings.
Alternate candidates may be found that could be selected instead of CEACAM6 for inclusion in a targeted blood transcriptional assay.CEACAM6 was chosen for this evaluation based on its membership to module M10.4, which is part of the fixed BloodGen3 repertoire. 17Such module repertoires can be employed as a framework for the design of targeted assays, in which case only one or a few representative transcripts from a given module would usually be selected to provide coverage for the entire repertoire (those modules are formed based on co-expression and all constitutive transcripts would present with a high degree of co-linearity). 9In the case of module M10.4,other candidates to consider would be CEACAM8, BPI, MPO, LTF, DEFA1, DEFA3, DEFA4, CTSG, OLFM4, and ELANE, since all of those genes belong to the same module as CEACAM6 (Table 5).However, to date, only CEACAM6 has been investigated in depth and thus it is not yet possible to benchmark it against these other candidates.However, it can already be noted that BPI (bactericidal/permeability-increasing protein) has been found to be of potential value as a biomarker in patients with asthma, 76 as well as chronic obstructive pulmonary disease. 77DEFA1 and DEFA3 have been identified as potential inflammatory biomarkers for coronary heart disease. 78CEACAM8, another member of the carcinoembryonic cell adhesion molecule family, has been found to be of potential value as a prognosis marker in patients with esophageal cancer and in patients with sepsis. 79,80nally, it is worth highlighting some of the limitations of our investigation into the relevance of CEACAM6 as a blood transcriptome biomarker.For instance, it should be noted that the screen conducted among public transcriptome data is not comprehensive.Additional blood transcriptome datasets are available in GEO and other repositories that have not yet been loaded in GXB instances.As a result, the list of conditions in which CEACAM6 blood transcript abundance changes is probably conservative and will likely grow as more datasets become available for screening.The current methodology reliance on a systematic, manual approach to data retrieval and structuring is another limitation.We recognize the potential of automation to transform this labor-intensive process.In this respect, we are actively exploring the integration of Large Language Models (LLMs) into our data curation workflow.These advanced models show promise in streamlining the identification, extraction, and structuring of relevant information, potentially mitigating the challenges associated with the sheer volume and dynamic nature of biomedical databases.Our preliminary explorations suggest that while LLMs may not fully replace the nuanced judgment of human curators, they offer significant support by enhancing efficiency and accuracy, thereby complementing our existing methodologies.Thus, we are cautiously optimistic about the role of LLMs in enhancing our data analysis framework, aiming to improve efficiency while maintaining accuracy.This integration of LLMs is an ongoing effort and will be detailed further in upcoming publications.
In conclusion, the information presented here should help researchers decide whether to include CEACAM6 in the targeted assay they intend to develop.Some of our findings suggest that measuring abundance of CEACAM6 transcripts in blood could prove to be of value in the monitoring and management of patients with diseases associated with systemic inflammation.This would likely be true for other members of the BloodGen3 module M10.4/"neutrophil activation" gene sets.However, CEACAM6 presents with the distinct advantage of also being of potential value in the management of patients with cancer, whether the assay would be used to measure transcript abundance in blood or in tumor tissues.This article offers comprehensive and detailed information on the process of screening a protein as a potential biomarker.It is assumed that this method can be applied to any protein for screening, assessment, and correlation with various diseases.
While the paper provides methods for screening a biomarker, it lacks clarity regarding its purpose.
It is unclear whether the objective is to explain the screening methods, or to demonstrate CEACAM6 as a potential biomarker, or both.
If it is both: The author is suggested to represent the flowchart of the methodology section in detail.1.
The paper lacks an explanation of CEACAM6 as a protein and its physiological functions/ pathways in the context of diseases since the title claims to reveal the potential relevance of CEACAM6.

2.
The paper is expected to balance both the methodology of biomarker screening as well as CEACAM6 functions and potentiality as a protein as well as biomarker.

If applicable, is the statistical analysis and its interpretation appropriate? Yes
Are all the source data underlying the results available to ensure full reproducibility?Yes

Are the conclusions drawn adequately supported by the results? Yes
Competing Interests: No competing interests were disclosed.

Reviewer Expertise: Rare Disease Genetics, Cancer Biology
We confirm that we have read this submission and believe that we have an appropriate level of expertise to confirm that it is of an acceptable scientific standard, however we have significant reservations, as outlined above.

If applicable, is the statistical analysis and its interpretation appropriate? Yes
Are all the source data underlying the results available to ensure full reproducibility?Yes Are the conclusions drawn adequately supported by the results?and there are also clinical studies in cancer where this is being targeted as an immune checkpoint.
This study assesses changes in expression of any gene in disease cases to qualify it as a marker but equally important is what approaches have been used or what indications exist that it has been studied as a therapeutic target and if so the outcome of such study.There are several publications for studies of CEACAM6 as a therapeutic target in cancer.Examples of few of the published work from NCBI PubMed -PMID: 19334050, PMID: 35141051, PMID: 31797958, PMID: 35082925 and a clinical trialhttps://clinicaltrials.gov/study/NCT03596372.These should be cited as similar approaches could be utilized for other diseases.

Reviewer Expertise: Cancer research
We confirm that we have read this submission and believe that we have an appropriate level of expertise to confirm that it is of an acceptable scientific standard, however we have significant reservations, as outlined above.
Author Response 08 Mar 2024

Damien Chaussabel
Thank you for your constructive review and insightful comments.We understand the concerns raised regarding the clarity of our manuscript's focus and the methodologies employed for data retrieval and analysis.Allow us to address these points with additional context that we believe will clarify our intentions and the contributions of our work.

Clarification of Manuscript Focus:
Our current work is intended as a proof of concept paper, demonstrating the practical application of the methodological framework we detailed in a previously published paper ("A training curriculum for retrieving, structuring, and aggregating information derived from the biomedical literature and large-scale data repositories").This earlier publication provides an in-depth description of the methods and serves as the foundation upon which our current study on CEACAM6 is built.We acknowledge that this may not have been made sufficiently clear in our manuscript, leading to confusion about its primary focus.We revised our introduction sections to explicitly state that this work is a demonstration of our previously published methodology applied to the CEACAM6 gene, rather than an exposition of new methodological advancements: "The methodology employed in this study is derived from our previously established "collective omics data" (COD) training curriculum, as outlined in our comprehensive methods paper, "A training curriculum for retrieving, structuring, and aggregating information derived from the biomedical literature and large-scale data repositories." 10.This foundational paper provides a detailed description of our systematic approach to information curation, which we have applied in the current investigation of CEACAM6.Specifically, the study utilizes the COD1 training module workflow from this curriculum, which guides the structured retrieval and aggregation of genespecific data for biomarker assessment.The process encompasses selecting a gene of interest, in this case, CEACAM6, to comprehensively gather and synthesize relevant information from both literature and public datasets, culminating in the creation of resources like structured data tables and interactive circle packing plots.This approach not only supports the rigorous assessment of CEACAM6's potential as a blood biomarker but also serves as a demonstrative application of our validated methodological framework, providing a practical example of how such a framework can be employed to enhance biomarker discovery efforts."Data Retrieval and Analysis: The method described in our prior paper involves a systematic but manual approach to retrieving, structuring, and aggregating information, which, as you rightly pointed out, can be labor-intensive.We recognize the importance of automation in managing the vast amount of data available in biomedical research.While our published method does not incorporate automated processes, we have, in response to the evolving needs of data curation, begun exploring the potential of Large Language Models (LLMs) to assist in manual data curation tasks.This effort is led by staff who have recently joined our group and represents an exciting direction for enhancing the efficiency and scalability of our data curation processes.Preliminary findings suggest that while LLMs are not a complete substitute for manual curation, they can significantly aid in the process by streamlining the identification and extraction of relevant information.This ongoing work acknowledges the pertinence of your feedback regarding automation and highlights our commitment to advancing our methodologies in line with technological developments.Although the results of these explorations are not included in the current manuscript, they are part of a separate study that we plan to publish in the future.This will detail our experiences and findings regarding the integration of LLMs into our data curation workflow, providing insights that could benefit the broader research community in handling similar challenges.A new paragraph has been added to the discussion, acknowledging current limitations and potential strategies for automating the information extraction workflow: "The current methodology reliance on a systematic, manual approach to data retrieval and structuring is another limitation.We recognize the potential of automation to transform this labor-intensive process.In this respect, we are actively exploring the integration of Large Language Models (LLMs) into our data curation workflow.These advanced models show promise in streamlining the identification, extraction, and structuring of relevant information, potentially mitigating the challenges associated with the sheer volume and dynamic nature of biomedical databases.Our preliminary explorations suggest that while LLMs may not fully replace the nuanced judgment of human curators, they offer significant support by enhancing efficiency and accuracy, thereby complementing our existing methodologies.Thus we are cautiously optimistic about the role of LLMs in enhancing our data analysis framework, aiming to improve efficiency while maintaining accuracy.This integration of LLMs is an ongoing effort and will be detailed further in upcoming publications."Discussion on Therapeutic Targeting of CEACAM6: We appreciate your pointing out the need to correct and expand our discussion on therapeutic efforts targeting CEACAM6.It was not our intention to overlook significant research in this area.We have revised the relevant sections to accurately reflect ongoing and completed studies targeting CEACAM6 with therapeutic intent, citing the publications and clinical trials you mentioned.This will ensure our discussion acknowledges both the biomarker potential of CEACAM6 and its implications for therapeutic development: Recent studies and ongoing clinical trials have explored the utility of targeting CEACAM6 in various cancers, particularly through the development of monoclonal antibodies.For instance, preclinical evaluations have demonstrated the potential of CEACAM6 as a therapy target in pancreatic adenocarcinoma, utilizing antibody-drug conjugates to effectively target and diminish CEACAM6-expressing tumors (PMID: 19334050).Additionally, the blocking of CEACAM6-CEACAM1 interactions has shown promise in enhancing T cell-mediated cancer cell elimination, suggesting a role for CEACAM6 in immune modulation and its potential as an immune checkpoint target (PMID: 35141051).The breadth of research, encompassing studies on its prognostic value and therapeutic targeting in cancers, underscores CEACAM6's significance in oncology and its emerging role as a viable therapeutic target.These investigations, reflected in various studies (PMID: 31797958, PMID: 35082925) and a clinical trial registered under NCT03596372, collectively indicate a growing interest in CEACAM6 as a therapeutic target, warranting further The benefits of publishing with F1000Research: Your article is published within days, with no editorial bias • You can publish traditional articles, null/negative results, case reports, data notes and more • The peer review process is transparent and collaborative • Your article is indexed in PubMed after passing peer review • Dedicated customer support at every stage • For pre-submission enquiries, contact research@f1000.com

Figure 1 .
Figure 1.Differences in transcript abundance levels for BloodGen3 module M10.4 across 16 reference datasets.A. The module fingerprint heatmap represents the proportion of transcript for a given module (rows)for which abundance levels are significantly different in case subjects compared to the respective controls for a given reference dataset (columns).Values can range from +100% (solid red: abundance for all constitutive transcripts for the module are significantly higher) to À100% (solid blue: abundance for all constitutive transcripts for the module are significantly lower).Responses are shown for four modules included in the module aggregate A38 from the BloodGen3 repertoire, 17 including module M10.4 from which CEACAM6 was selected.B. The box plot represents the percentage response averaged for module M10.4,across the 16 reference datasets (we have contributed this dataset collection to GEO as part of an earlier work,17 and it is accessible under accession number GSE100150.Plots were generated using the BloodGen3 web application: https://drinchai.shinyapps.io/BloodGen3Module/.

Figure 2 .
Figure 2. CEACAM6 disease and cell type literature profiles.The prevalence of articles among the literature associated with CEACAM6 for disease entities (A) or cell type entities (B) are represented by circles of different sizes and colors, corresponding to the number of associated articles.It is possible to access underlying information by zooming into each of the circles.The Prezi presentation can be accessed at this url: https://prezi.com/view/pQ7TKEC6tgY3cuik9ckt/ Step 3: background literature profiling.

Figure 4 .
Figure 4. CEACAM6 restriction among circulating leukocyte populations.This box plot shows levels of abundance of CEACAM6 RNA measured by RNA sequencing in neutrophils, monocytes, B-cells, CD4+ T-cells, CD8+ T-cells and NK cells purified from the blood of human subjects, including patients with ALS, type 1 diabetes, multiple sclerosis (immediately before and 24 hours after initiation of beta interferon therapy) or sepsis and healthy controls.Values are normalized to the median calculated across all conditions.For details, see original work by Linsley et al.60 GEO deposition: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE60424 plot: https://plotly.com/~dchaussabel/171/.

Figure 5 .
Figure 5. CEACAM6 expression at the single-cell level among dissociated prostate tumor tissue cells.This tSNE plot shows abundance levels of CEACAM6 measured by single-cell RNA sequencing among dissociated metastatic prostate tumor tissue cells.After quality control, this set consisted of 2,170 cells obtained from 14 patients and 15 biopsies.Clusters are labelled for dominant cell type based on marker gene expression on the plot above.Normalized transcript per million (TPM) counts for CEACAM6 are shown in blue on the plot below.For details, see original work by He et al. 63 An interactive version of this plot is accessible via the Broad Institute single cell portal: https://singlecell.broadinstitute.org/single_cell/study/SCP1244/transcriptional-mediators-of-treatment-resistancein-lethal-prostate-cancer?genes=CEACAM6#study-visualize.

Table 1 .
List of online resources employed for profiling CEACAM6 literature/transcriptional data, including those generated as part of the present work. 18

Table 2 .
List of the most prevalent diseases/physiological states and cell types found among the CEACAM6 literature.

Table 3 .
Published reports describing CEACAM6 as being of clinical relevance as a biomarker.

Table 4 .
Pathological, immunological, or physiological states where CEACAM6 transcript abundance levels have been found to differ in cases vs controls.

Table 5 .
Published gene signatures comprising CEACAM6.This table lists targeted gene sets or gene panels comprising CEACAM6.Lists of differentially expressed genes that consists of tens or hundreds of transcripts are purposedly omitted.