Skip to contents

If you are conducting bioinformatics research, especially when dealing with GEO (Gene Expression Omnibus) data, you may have used the GEOmetadb package. While it was once a great tool, unfortunately, it has not been updated for years. Additionally, the GEOmetadb package is static, relying on the developer to manually update it, which often results in incomplete or outdated data.

Why Build Your Own GEOmetadb?

  • Focus on Your Research Area: For example, diabetes, liver cancer, urothelial carcinoma, etc. You can aggregate and annotate all relevant Series, making it easier for subsequent bulk downloads, filtering, and reproducible analysis.
  • Flexible Offline Querying: GEO’s web search is not flexible enough. After building your own database with geokit, you can use familiar tools like regular expressions or database queries (e.g., grep, SQLite) to freely search and meet various needs.
  • Dynamic Updates and Incremental Sync: Automate scripts to regularly update the database, ensuring that the data is always up-to-date and avoiding the hassle of manual updates.
  • Efficient Data Access: After storing the data offline, query speeds are significantly improved, especially when handling large GEO Series datasets, saving both time and bandwidth.

geokit offers an efficient and straightforward method, with core operations implemented in rust, allowing you to build your own metadata database in just a few minutes. For example, processing 654 records takes around 30.5 seconds if the data is already downloaded.

  • Search the GEO database using NCBI Eutils and extract relevant metadata.
  • Fetch the relevant metadata from the GEO database and save it locally to build your own GEOmetadb.
  • Use R’s regular expressions, filtering features, or tools like Excel and SQLite to quickly search and analyze Series (GSE) and sample information.

Search and Filter Single-Cell Studies of Urothelial Carcinoma

Based on the NCBI Eutils query, more details can be found here: Querying GEO DataSets

uc_gse <- list(
    geo_search(
        "bladder cancer[ALL] AND Homo sapiens[ORGN] AND GSE[ETYP]"
    ),
    geo_search(
        "urothelial cancer[ALL] AND Homo sapiens[ORGN] AND GSE[ETYP]"
    )
)
#> ■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■  627/626 [284/s] | ETA:  0s
#> → Parsing GEO records
#> ■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■  627/626 [284/s] | ETA:  0sGet records from NCBI for 626 queries in 2.3s
#> 
#> → Parsing GEO records
uc_gse <- unique(dplyr::bind_rows(uc_gse))

2. Extract Sample Count from the “Contains” Field

uc_gse$number_of_samples <- stringr::str_match(
  uc_gse$Contains, "(\\d+) Samples?"
)[, 2L, drop = TRUE]

# Quick Statistics
max_samples <- max(as.numeric(uc_gse$number_of_samples), na.rm = TRUE)
median_samples <- median(as.numeric(uc_gse$number_of_samples), na.rm = TRUE)
top_series <- dplyr::slice_max(uc_gse, as.numeric(number_of_samples))

3. Fetch Series Metadata (Parallel Processing and Batch Saving to Local Directory)

uc_gse_meta <- geo_meta(uc_gse[["Series Accession"]],
  odir = "gse_urothelial_cancer"
)

4. Filter Possible Single-Cell Studies (Search for Keywords in Summary/Title/Design)

uc_gse_sc <- dplyr::filter(
  uc_gse_meta,
  dplyr::if_any(
    c(Series_summary, Series_title, Series_overall_design),
    grepl,
    pattern = "single[- ]cell|scRNA", ignore.case = TRUE
  )
) |>
  dplyr::mutate(
    number_of_samples = lengths(
      strsplit(Series_sample_id, "; ", fixed = TRUE)
    )
  )

# Output Results and Statistics
dplyr::slice_max(uc_gse_sc, number_of_samples)$Series_geo_accession
max(uc_gse_sc$number_of_samples, na.rm = TRUE)
median(uc_gse_sc$number_of_samples, na.rm = TRUE)
writexl::write_xlsx(uc_gse_sc, "uc_gse_sc.xlsx")

By following these steps, you have successfully created your own GEOmetadb!

Session Information

sessionInfo()
#> R version 4.5.2 (2025-10-31)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.3 LTS
#> 
#> Matrix products: default
#> BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so;  LAPACK version 3.12.0
#> 
#> locale:
#>  [1] LC_CTYPE=C.UTF-8       LC_NUMERIC=C           LC_TIME=C.UTF-8       
#>  [4] LC_COLLATE=C.UTF-8     LC_MONETARY=C.UTF-8    LC_MESSAGES=C.UTF-8   
#>  [7] LC_PAPER=C.UTF-8       LC_NAME=C              LC_ADDRESS=C          
#> [10] LC_TELEPHONE=C         LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C   
#> 
#> time zone: UTC
#> tzcode source: system (glibc)
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] geokit_0.0.1.9000
#> 
#> loaded via a namespace (and not attached):
#>  [1] vctrs_0.7.0       httr_1.4.7        cli_3.6.5         knitr_1.51       
#>  [5] rlang_1.1.7       xfun_0.56         stringi_1.8.7     generics_0.1.4   
#>  [9] textshaping_1.0.4 jsonlite_2.0.0    glue_1.8.0        htmltools_0.5.9  
#> [13] XML_3.99-0.20     ragg_1.5.0        sass_0.4.10       rmarkdown_2.30   
#> [17] tibble_3.3.1      evaluate_1.0.5    jquerylib_0.1.4   fastmap_1.2.0    
#> [21] yaml_2.3.12       lifecycle_1.0.5   stringr_1.6.0     compiler_4.5.2   
#> [25] rentrez_1.2.4     dplyr_1.1.4       fs_1.6.6          pkgconfig_2.0.3  
#> [29] systemfonts_1.3.1 digest_0.6.39     R6_2.6.1          tidyselect_1.2.1 
#> [33] pillar_1.11.1     curl_7.0.0        magrittr_2.0.4    bslib_0.9.0      
#> [37] tools_4.5.2       pkgdown_2.2.0     cachem_1.1.0      desc_1.4.3