Apr 2025

Volume 41Issue 4p261-358, e1-e2
Large language models trained on DNA sequences, also known as genomic language models (gLMs), hold significant potential to advance our understanding of genomes and the interactions between DNA elements that drive complex functions. In this issue, Benegas et al. review key opportunities and challenges for gLMs, outlining important considerations for their development and evaluation to benefit the genomics community. In this image, the two binary strings correspond to reverse-complementary DNA sequences (00 = A, 01 = C, 10 = G, and 11 = T). The connecting rectangles represent “embeddings” learned by gLMs. Illustration by Yun S. Song....
Large language models trained on DNA sequences, also known as genomic language models (gLMs), hold significant potential to advance our understanding of genomes and the interactions between DNA elements that drive complex functions. In this issue, Benegas et al. review key opportunities and challenges for gLMs, outlining important considerations for their development and evaluation to benefit the genomics community. In this image, the two binary strings correspond to reverse-complementary DNA sequences (00 = A, 01 = C, 10 = G, and 11 = T). The connecting rectangles represent “embeddings” learned by gLMs. Illustration by Yun S. Song.

Science & Society

  • We must not ignore eugenics in our genetics curriculum

    • Mark Peifer
    Eugenics, that promoted planned breeding to ensure 'racial improvement', was central to the development of genetics and led to horrifying policies. However, eugenics is not dead and continues to influence science and policy today. Thus, we should include eugenics in our undergraduate classes to remind students that scientists must speak out when others lie about science and use it to further their political views.

Spotlights

  • Genetic buffering mechanisms in SNF2-family translocases

    • Sumedha Agashe,
    • Alessandro Vindigni
    Open Access
    SNF2-family DNA translocases, a large family of ATPases, have poorly defined roles in genomic stability. In a recent study, Feng et al. identified a synthetic lethal interaction between the SNF2 translocase SMARCAL1 and Fanconi anemia (FA) group M (FANCM), revealing a new genetic buffering mechanism that maintains genome stability by aiding DNA replication at loci enriched in simple repeats.
  • A more elaborate genetic clock for clonal species

    • Jinhee Ryu,
    • Yeonjin Kim,
    • Young Seok Ju
    The genetic clock is a well-established tool used in evolutionary biology for estimating divergence times between species, individuals, or cells based on DNA sequence changes. Yu et al. have revisited the clock to make it applicable to clonal multicellular organisms that expand through asexual reproduction mechanisms, enabling more comprehensive evolutionary tracking.

Forum

  • Leveraging spatial multiomics to unravel tissue architecture in embryo development

    • Fuqing Jiang,
    • Haoxian Wang,
    • Zhuxia Li,
    • Guizhong Cui,
    • Guangdun Peng
    Spatial multiomics technologies have revolutionized biomedical research by enabling the simultaneous measurement of multiple omics modalities within intact tissue sections. This approach facilitates the reconstruction of 3D molecular architectures, providing unprecedented insights into complex cellular interactions and the intricate organization of biological systems, such as those underlying embryonic development.

Opinion

  • Q-rich activation domains: flexible ‘rulers’ for transcription start site selection?

    • Andrea Bernardini,
    • Roberto Mantovani
    Open Access
    Recent findings broadened the function of RNA polymerase II (Pol II) proximal promoter motifs from quantitative regulators of transcription to important determinants of transcription start site (TSS) position. These motifs are recognized by transcription factors (TFs) that we propose to term ‘ruler’ TFs (rTFs), such as NRF1, NF-Y, YY1, ZNF143, BANP, and members of the SP, ETS, and CRE families, sharing as a common feature a glutamine-rich (Q-rich) effector domain also enriched in valine, isoleucine, and threonine (QVIT-rich). We propose that rTFs guide TSS location by constraining the position of the pre-initiation complex (PIC) during its promoter recognition phase through a specialized, and still enigmatic, class of activation domains.

Reviews

  • Genomic language models: opportunities and challenges

    • Gonzalo Benegas,
    • Chengzhong Ye,
    • Carlos Albors,
    • Jianan Canal Li,
    • Yun S. Song
    Large language models (LLMs) are having transformative impacts across a wide range of scientific fields, particularly in the biomedical sciences. Just as the goal of natural language processing is to understand sequences of words, a major objective in biology is to understand biological sequences. Genomic language models (gLMs), which are LLMs trained on DNA sequences, have the potential to significantly advance our understanding of genomes and how DNA elements at various scales interact to give rise to complex functions. To showcase this potential, we highlight key applications of gLMs, including functional constraint prediction, sequence design, and transfer learning. Despite notable recent progress, however, developing effective and efficient gLMs presents numerous challenges, especially for species with large, complex genomes. Here, we discuss major considerations for developing and evaluating gLMs.
  • The long and short of hyperdivergent regions

    • Nicolas D. Moya,
    • Stephanie M. Yan,
    • Rajiv C. McCoy,
    • Erik C. Andersen
    The increasing prevalence of genome sequencing and assembly has uncovered evidence of hyperdivergent genomic regions – loci with excess genetic diversity – in species across the tree of life. Hyperdivergent regions are often enriched for genes that mediate environmental responses, such as immunity, parasitism, and sensory perception. Especially in self-fertilizing species where the majority of the genome is homozygous, the existence of hyperdivergent regions might imply the historical action of evolutionary forces such as introgression and/or balancing selection. We anticipate that the application of new sequencing technologies, broader taxonomic sampling, and evolutionary modeling of hyperdivergent regions will provide insights into the mechanisms that generate and maintain genetic diversity within and between species.
  • Keeping it safe: control of meiotic chromosome breakage

    • Adhithi R. Raghavan,
    • Andreas Hochwagen
    Meiotic cells introduce numerous programmed DNA double-strand breaks (DSBs) into their genome to stimulate crossover recombination. DSB numbers must be high enough to ensure each homologous chromosome pair receives the obligate crossover required for accurate meiotic chromosome segregation. However, every DSB also increases the risk of aberrant or incomplete DNA repair, and thus genome instability. To mitigate these risks, meiotic cells have evolved an intricate network of controls that modulates the timing, levels, and genomic location of meiotic DSBs. This Review summarizes our current understanding of these controls with a particular focus on the mechanisms that prevent meiotic DSB formation at the wrong time or place, thereby guarding the genome from potentially catastrophic meiotic errors.
  • Cell-free DNA from clinical testing as a resource of population genetic analysis

    • Huanhuan Zhu,
    • Yu Wang,
    • Linxuan Li,
    • Lin Wang,
    • Haiqiang Zhang,
    • Xin Jin
    As a noninvasive biomarker, cell-free DNA (cfDNA) has achieved remarkable success in clinical applications. Notably, cfDNA is essentially DNA, and conducting whole-genome sequencing (WGS) can yield a wealth of genetic information. These invaluable data should not be confined to one-time use; instead, they should be leveraged for more comprehensive population genetic analysis, including genetic variation spectrum, population structure and genetic selection, and genome-wide association studies (GWASs), among others. Such research findings can, in turn, facilitate clinical practice, enabling more advanced and accurate disease predictions. This review explores the advantages, challenges, and current research areas of cfDNA in population genetics. We hope that this review can serve as a new chapter in the repurposing of cfDNA sequence data generated from clinical testing in population genetics.
  • Developmental evolution in fast-forward: insect male genital diversification

    • Maria D.S. Nunes,
    • Alistair P. McGregor
    Open Access
    Insect male genitalia are among the fastest evolving structures of animals. Studying these changes among closely related species represents a powerful approach to dissect developmental processes and genetic mechanisms underlying phenotypic diversification and the underlying evolutionary drivers. Here, we review recent breakthroughs in understanding the developmental and genetic bases of the evolution of genital organs among Drosophila species and other insects. This work has helped reveal how tissue and organ size evolve and understand the appearance of morphological novelties, and how these phenotypic changes are generated through altering gene expression and redeployment of gene regulatory networks. Future studies of genital evolution in Drosophila and a wider range of insects hold great promise to help understand the specification, differentiation, and diversification of organs more generally.

Correction

Advertisement
Advertisement