Semantic Scholar Research

nodesWe're a team of researchers and engineers with diverse backgrounds collaborating to solve some of the toughest problems in AI research.

Our mission at Semantic Scholar is to accelerate scientific breakthroughs by helping scholars overcome information overload.

Our team of world-class AI research scientists is dedicated to studying information overload and developing groundbreaking AI and tools to overcome it. Active areas of AI research include Natural Language Processing, Machine Learning, Human Computer Interaction, and Information Retrieval.

We are a part of the Allen Institute for AI, a nonprofit research institute founded by Microsoft co-founder Paul Allen to develop AI that benefits the common good.

Learn more about our research below, and contact us if you’re interested in collaborating, learning more about our models, or joining our team.

Semantic Scholar Papers

  • Abductive Commonsense Reasoning
    ICLR 2020
    Chandra Bhagavatula, Ronan Le Bras, Chaitanya Malaviya, Keisuke Sakaguchi, Ari Holtzman, Hannah Rashkin, Doug Downey, Scott Wen-tau Yih, Yejin Choi
    Abductive reasoning is inference to the most plausible explanation. For example, if Jenny finds her house in a mess when she returns from work, and remembers that she left a window open, she can hypothesize that a thief broke into her house and caused the mess, as the most plausible explanation… (More)

  • SciBERT: A Pretrained Language Model for Scientific Text
    EMNLP 2019
    Iz Beltagy, Kyle Lo, Arman Cohan
    Obtaining large-scale annotated data for NLP tasks in the scientific domain is challenging and expensive. We release SciBERT, a pretrained language model based on BERT (Devlin et al., 2018) to address the lack of high-quality, large-scale labeled scientific data. SciBERT leverages unsupervised… (More)

  • Pretrained Language Models for Sequential Sentence Classification
    EMNLP 2019
    Arman Cohan, Iz Beltagy, Daniel King, Bhavana Dalvi, Daniel S. Weld
    As a step toward better document-level understanding, we explore classification of a sequence of sentences into their corresponding categories, a task that requires understanding sentences in context of the document. Recent successful models for this task have used hierarchical models to… (More)

  • SpanBERT: Improving Pre-training by Representing and Predicting Spans

    EMNLP 2019
    Mandar Joshi, Danqi Chen, Yinhan Liu, Daniel S. Weld, Luke Zettlemoyer, Omer Levy
    We present SpanBERT, a pre-training method that is designed to better represent and predict spans of text. Our approach extends BERT by (1) masking contiguous random spans, rather than random tokens, and (2) training the span boundary representations to predict the entire content of the masked span…  (More)

  • BERT for Coreference Resolution: Baselines and Analysis

    EMNLP 2019
    Mandar Joshi, Omer Levy, Daniel S. Weld, Luke Zettlemoyer
    We apply BERT to coreference resolution, achieving strong improvements on the OntoNotes (+3.9 F1) and GAP (+11.5 F1) benchmarks. A qualitative analysis of model predictions indicates that, compared to ELMo and BERT-base, BERT-large is particularly better at distinguishing between related but…  (More)

  • GrapAL: Connecting the Dots in Scientific Literature
    ACL 2019
    Christine Betts, Joanna Power, Waleed Ammar
    We introduce GrapAL (Graph database of Academic Literature), a versatile tool for exploring and investigating a knowledge base of scientific literature, that was semi-automatically constructed using NLP methods. GrapAL satisfies a variety of use cases and information needs requested by researchers…  (More)

  • Abductive Commonsense Reasoning

    Chandra Bhagavatula, Ronan Le Bras, Chaitanya Malaviya, Keisuke Sakaguchi, Ari Holtzman, Hannah Rashkin, Doug Downey, Scott Yih, Yejin Choi

    Abductive reasoning is inference to the most plausible explanation. For example, if Jenny finds her house in a mess when she returns from work, and remembers that she left a window open, she can hypothesize that a thief broke into her house and caused the mess, as the most plausible explanation…  (More)

  • ScispaCy: Fast and Robust Models for Biomedical Natural Language Processing
    ACL • BioNLP Workshop 2019
    Mark Neumann, Daniel King, Iz Beltagy, Waleed Ammar

    Despite recent advances in natural language processing, many statistical models for processing text perform extremely poorly under domain shift. Processing biomedical and clinical text is a critically important application area of natural language processing, for which there are few robust…  (More)

  • Ontology-Aware Clinical Abstractive Summarization

    SIGIR 2019 Sean MacAvaney, Sajad Sotudeh, Arman Cohan, Nazli Goharian, Ish Talati, Ross W. Filice

    Automatically generating accurate summaries from clinical reports could save a clinician's time, improve summary coverage, and reduce errors. We propose a sequence-to-sequence abstractive summarization model augmented with domain-specific ontological information to enhance content selection and…  (More

  • CEDR: Contextualized Embeddings for Document Ranking

    SIGIR 2019

    Sean MacAvaney, Andrew Yates, Arman Cohan, Nazli Goharian

    Although considerable attention has been given to neural ranking architectures recently, far less attention has been paid to the term representations that are used as input to these models. In this work, we investigate how two pretrained contextualized language models (ELMo and BERT) can be…  (More)

  • Quantifying Sex Bias in Clinical Studies at Scale With Automated Data Extraction

    JAMA 2019

    Sergey Feldman, Waleed Ammar, Kyle Lo, Elly Trepman, Madeleine van Zuylen, Oren Etzioni

    Importance: Analyses of female representation in clinical studies have been limited in scope and scale. Objective: To perform a large-scale analysis of global enrollment sex bias in clinical studies. Design, Setting, and Participants: In this cross-sectional study, clinical studies from published…  (More)

  • Gender trends in computer science authorship


    Lucy Lu Wang, Gabriel Stanovsky, Luca Weihs, Oren Etzioni

    A comprehensive and up-to-date analysis of Computer Science literature (2.87 million papers through 2018) reveals that, if current trends continue, parity between the number of male and female authors will not be reached in this century. Under our most optimistic projection models, gender parity is…  (More)

  • Structural Scaffolds for Citation Intent Classification in Scientific Publications

    NAACL 2019
    Arman Cohan, Waleed Ammar, Madeleine van Zuylen, Field Cady
    Identifying the intent of a citation in scientific papers (e.g., background information, use of methods, comparing results) is critical for machine reading of individual publications and automated analysis of the scientific literature. We propose a multitask approach to incorporate information in…  (More)

  • Combining Distant and Direct Supervision for Neural Relation Extraction
    NAACL 2019
    Iz Beltagy, Kyle Lo, Waleed Ammar
    In relation extraction with distant supervision, noisy labels make it difficult to train quality models. Previous neural models addressed this problem using an attention mechanism that attends to sentences that are likely to express the relations. We improve such models by combining the distant…  (More)

  • Citation Count Analysis for Papers with Preprints

    Sergey Feldman, Kyle Lo, Waleed Ammar
    We explore the degree to which papers prepublished on arXiv garner more citations, in an attempt to paint a sharper picture of fairness issues related to prepublishing. A paper’s citation count is estimated using a negative-binomial generalized linear model (GLM) while observing a binary variable…  (More)

  • Construction of the Literature Graph in Semantic Scholar
    NAACL-HLT 2018
    Waleed Ammar, Dirk Groeneveld, Chandra Bhagavatula, Iz Beltagy, Miles Crawford, Doug Downey, Jason Dunkelberger, Ahmed Elgohary, Sergey Feldman, Vu Ha, Rodney Kinney, Sebastian Kohlmeier, Kyle Lo, Tyler Murray, Hsu-Han Ooi, Matthew E. Peters, et al.
    We describe a deployed scalable system for organizing published scientific literature into a heterogeneous graph to facilitate algorithmic manipulation and discovery. The resulting literature graph consists of more than 280M nodes, representing papers, authors, entities and various interactions…  (More)

  • Content-Based Citation Recommendation

    NAACL-HLT 2018
    Chandra Bhagavatula, Sergey Feldman, Russell Power, Waleed Ammar
    We present a content-based method for recommending citations in an academic paper draft. We embed a given query document into a vector space, then use its nearest neighbors as candidates, and rerank the candidates using a discriminative model trained to distinguish between observed and unobserved…  (More)

  • A Dataset of Peer Reviews (PeerRead): Collection, Insights and NLP Applications

    NAACL-HLT2018 Dataset
    Dongyeop Kang, Waleed Ammar, Bhavana Dalvi Mishra, Madeleine van Zuylen, Sebastian Kohlmeier, Eduard Hovy, Roy Schwartz
    Peer reviewing is a central component in the scientific publishing process. We present the first public dataset of scientific peer reviews available for research purposes (PeerRead v1), providing an opportunity to study this important artifact. The dataset consists of 14.7K paper drafts and the…  (More)

  • Extracting Scientific Figures with Distantly Supervised Neural Networks
    JCDL 2018
    Noah Siegel, Nicholas Lourie, Russell Power and Waleed Ammar
    Non-textual components such as charts, diagrams and tables provide key information in many scientific documents, but the lack of large labeled datasets has impeded the development of data-driven methods for scientific figure extraction. In this paper, we induce high-quality training labels for the…  (More)

  • Ontology Alignment in the Biomedical Domain Using Entity Definitions and Context

    ACL • Proceedings of the BioNLP 2018 Workshop
    Lucy L. Wang, Chandra Bhagavatula, M. Neumann, Kyle Lo, Chris Wilhelm, Waleed Ammar
    Ontology alignment is the task of identifying semantically equivalent entities from two given ontologies. Different ontologies have different representations of the same entity, resulting in a need to de-duplicate entities when merging ontologies. We propose a method for enriching entities in an…  (More)

  • Semi-supervised sequence tagging with bidirectional language models
    ACL 2017
    Matthew E. Peters, Waleed Ammar, Chandra Bhagavatula, and Russell Power
    Pre-trained word embeddings learned from unlabeled text have become a standard component of neural network architectures for NLP tasks. However, in most cases, the recurrent network that operates on word-level representations to produce context sensitive representations is trained on relatively…  (More)

  • Explicit Semantic Ranking for Academic Search via Knowledge Graph Embedding
    WWW 2017
    Chenyan Xiong, Russell Power and Jamie Callan
    This paper introduces Explicit Semantic Ranking (ESR), a new ranking technique that leverages knowledge graph embedding. Analysis of the query log from our academic search engine,, reveals that a major error source is its inability to understand the meaning of research concepts…  (More)

  • The AI2 system at SemEval-2017 Task 10 (ScienceIE): semi-supervised end-to-end entity and relation extraction

    SemEval 2017
    Waleed Ammar, Matthew E. Peters, Chandra Bhagavatula, and Russell Power
    This paper describes our submission for the ScienceIE shared task (SemEval-2017 Task 10) on entity and relation extraction from scientific papers. Our model is based on the end-to-end relation extraction model of Miwa and Bansal (2016) with several enhancements such as semi-supervised learning via…  (

  • Learning to Predict Citation-Based Impact Measures

    JCDL 2017
    Luca Weihs and Oren Etzioni
    Citations implicitly encode a community's judgment of a paper's importance and thus provide a unique signal by which to study scientific impact. Efforts in understanding and refining this signal are reflected in the probabilistic modeling of citation networks and the proliferation of citation-based…  (More)

  • End-to-End Neural Ad-hoc Ranking with Kernel Pooling

    SIGIR 2017
    Chenyan Xiong, Zhuyun Dai, Jamie Callan, Zhiyuan Liu, and Russell Power
    This paper proposes K-NRM, a kernel based neural model for document ranking. Given a query and a set of documents, K-NRM uses a translation matrix that models word-level similarities via word embeddings, a new kernel-pooling technique that uses kernels to extract multi-level soft match features…  (

  • AI zooms in on highly influential citations
    Nature 2017
    Oren Etzioni
    The number of times a paper is cited is a poor proxy for its impact (see P. Stephan et al. Nature 544, 411–412; 2017). I suggest relying instead on a new metric that uses artificial intelligence (AI) to capture the subset of an author's or a paper's essential and therefore most highly influential…  (More)

  • Ontology Aware Token Embeddings for Prepositional Phrase Attachment
    ACL 2017
    Pradeep Dasigi, Waleed Ammar, Chris Dyer, and Eduard Hovy
    Type-level word embeddings use the same set of parameters to represent all instances of a word regardless of its context, ignoring the inherent lexical ambiguity in language. Instead, we embed semantic concepts (or synsets) as defined in WordNet and represent a word token in a particular context by…  (More)

  • Toward Automatic Bootstrapping of Online Communities Using Decision-theoretic Optimization
    CSCW 2016
    Shih-Wen Huang, Jonathan Bragg, Isaac Cowhey, Oren Etzioni, and Daniel S. Weld
    Successful online communities (e.g., Wikipedia, Yelp, and StackOverflow) can produce valuable content. However, many communities fail in their initial stages. Starting an online community is challenging because there is not enough content to attract a critical mass of active members. This paper…  (More)

  • PDFFigures 2.0: Mining Figures from Research Papers
    JCDL 2016
    Christopher Clark and Santosh Divvala
    Figures and tables are key sources of information in many scholarly documents. However, current academic search engines do not make use of figures and tables when semantically parsing documents or presenting document summaries to users. To facilitate these applications we develop an algorithm that…  (More)

  • Looking Beyond Text: Extracting Figures, Tables and Captions from Computer Science Papers

    AAAI • Workshop on Scholarly Big Data 2015
    Christopher Clark and Santosh Divvala
    Identifying and extracting figures and tables along with their captions from scholarly articles is important both as a way of providing tools for article summarization, and as part of larger systems that seek to gain deeper, semantic understanding of these articles. While many "off-the-shelf" tools…  (More)

  • Identifying Meaningful Citations

    AAAI • Workshop on Scholarly Big Data 2015
    Marco Valenzuela, Vu Ha, and Oren Etzioni
    We introduce the novel task of identifying important citations in scholarly literature, i.e., citations that indicate that the cited work is used or extended in the new effort. We believe this task is a crucial component in algorithms that detect and follow research topics and in methods that…  (More)

Contact Semantic Scholar Research