The GEO database provides SOFT (Simple Omnibus Format in Text) files for GPL, GSM, and GDS entities. SOFT is designed for efficient batch submission and download of data. It is a simple, line-based, plain text format, which makes it easy to generate SOFT files from common spreadsheet and database applications.
The geo_soft function allows you to download and
preprocess SOFT files. Here are some example codes to download SOFT
files for the GSM, and GDS entities:
gsm <- geo_soft("GSM1", odir = tempdir())
#> Downloading 1 file
gsm
#> <GEOSoft>
#> datatable: a 5494 * 3 data.frame
#> datatable vars: TAG COUNT TPM
#> columns: a 3 * 1 data.frame
#> columns vars: labelDescription
#> metadata: Sample_title Sample_geo_accession Sample_status ...
#> Sample_series_id Sample_data_row_count (29 total)
#> accession: GSM1
gds <- geo_soft("GDS10", odir = tempdir())
#> Downloading 1 file
gds
#> <GEOSoft>
#> datatable: a 39114 * 30 data.frame
#> datatable vars: ID_REF IDENTIFIER GSM582 ... GSM602 GSM609 (30 total)
#> columns: a 30 * 1 data.frame
#> columns vars: labelDescription
#> metadata: DATABASE DATASET SUBSET
#> accession: GDS10A single SOFT file can contain both data tables and accompanying
descriptive information for multiple, concatenated Platforms, Samples,
and/or Series records. The geokit package provides a
GEOSoft class object to store SOFT file contents. The
GEOSoft object contains six slots: accession,
rcd_type, rcd_name, metadata,
datatable, and columns.
accession: Stores the GEO accession ID.rcd_type: Indicates the type of record (e.g., Platform, Sample, Series, Datasets). This helps categorize the data and identify the nature of the record.rcd_name: Represents the name associated with the record (e.g., the GEO dataset name). It usually matches the accession, but in some cases, it may differ. This allows for a more flexible identification of the record.metadata: Contains the header metadata from the SOFT file.datatable: Contains the main data table, which is the primary data for analysis.columns: Provides descriptive column headers for the datatable.
You can use functions with the same names as the slots to extract the data.
head(datatable(gsm))
#> TAG COUNT TPM
#> 1 AAAAAAAAAA 17 1741.98
#> 2 AAAAAAATCA 1 102.47
#> 3 AAAAAAATTT 1 102.47
#> 4 AAAAAACAAA 1 102.47
#> 5 AAAAAACTCC 1 102.47
#> 6 AAAAAATAAA 1 102.47
head(columns(gsm))
#> labelDescription
#> TAG Ten base SAGE tag,
#> COUNT TAG NUMBER
#> TPM tags per million
head(datatable(gds))
#> ID_REF IDENTIFIER GSM582 GSM589 GSM583 GSM590 GSM584 GSM591 GSM585 GSM592
#> 1 1 1200011I18Rik 101 54 111 55 87 30 99 43
#> 2 2 2 26 23 30 27 19 22 32 19
#> 3 3 Ccdc28b NA NA NA NA NA NA NA NA
#> 4 4 AA014405 233 162 252 178 214 144 238 147
#> 5 5 Crebrf NA NA NA NA NA NA NA NA
#> 6 6 6 691 661 696 652 609 665 684 672
#> GSM586 GSM593 GSM587 GSM594 GSM588 GSM595 GSM596 GSM603 GSM597 GSM604 GSM598
#> 1 105 56 43 14 112 43 97 36 117 40 125
#> 2 24 25 14 49 32 29 31 22 26 26 35
#> 3 NA NA NA 7 NA 4 10 22 NA 15 NA
#> 4 250 166 86 22 236 139 216 112 241 130 270
#> 5 NA NA NA NA NA 3 NA NA NA NA NA
#> 6 644 679 631 596 609 606 601 557 596 580 601
#> GSM605 GSM599 GSM606 GSM600 GSM607 GSM601 GSM608 GSM602 GSM609
#> 1 45 99 1 109 38 87 18 72 16
#> 2 26 18 13 25 32 28 40 14 41
#> 3 23 NA 29 9 25 11 40 NA 22
#> 4 144 239 148 211 139 208 16 174 15
#> 5 NA NA NA NA NA NA NA NA NA
#> 6 554 562 561 580 568 519 562 497 564
head(columns(gds))
#> labelDescription
#> ID_REF Platform reference identifier
#> IDENTIFIER identifier
#> GSM582 Value for GSM582: NOD_S1; src: Spleen
#> GSM589 Value for GSM589: NOD_S2; src: Spleen
#> GSM583 Value for GSM583: Idd3_S1; src: Spleen
#> GSM590 Value for GSM590: Idd3_S2; src: SpleenFor the GPL entity, the structure differs from that of GSM and GDS.
The geokit package provides the GEOPlatform
class to store the contents of the GPL SOFT file. A GPL SOFT file
typically includes most of the contents found in its subset entities,
including both GSE and GSM. Therefore, the GEOPlatform
class contains both gse and gsm slots, each
being a list of GEOSoft objects.
gpl <- geo_soft("gpl98", odir = tempdir())
#> Downloading 1 file
gpl
#> <GEOSoft>
#> datatable: a 8934 * 16 data.frame
#> datatable vars: ID GB_ACC SPOT_ID ... Gene Ontology Cellular
#> Component Gene Ontology Molecular Function (16 total)
#> columns: a 16 * 1 data.frame
#> columns vars: labelDescription
#> metadata: Platform_title Platform_geo_accession Platform_status ...
#> Platform_series_id Platform_data_row_count (26 total)
#> accession: GPL98Similarly, the GSE entity contains subset entities of GPL and GSM.
The GEOSeries class provides both gpl and
gsm slots as lists of GEOSoft objects.
gse <- geo_soft("GSE10", odir = tempdir())
#> Downloading 1 file
gse
#> <GEOSeries>
#> gsm: GSM571 GSM572 GSM573 GSM574
#> gpl: GPL4
#> datatable: a 0 * 0 data.frame
#> columns: a 0 * 1 data.frame
#> columns vars: labelDescription
#> metadata: DATABASE SERIES
#> accession: GSE10Session Information
sessionInfo()
#> R version 4.5.2 (2025-10-31)
#> Platform: x86_64-pc-linux-gnu
#> Running under: Ubuntu 24.04.3 LTS
#>
#> Matrix products: default
#> BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
#>
#> locale:
#> [1] LC_CTYPE=C.UTF-8 LC_NUMERIC=C LC_TIME=C.UTF-8
#> [4] LC_COLLATE=C.UTF-8 LC_MONETARY=C.UTF-8 LC_MESSAGES=C.UTF-8
#> [7] LC_PAPER=C.UTF-8 LC_NAME=C LC_ADDRESS=C
#> [10] LC_TELEPHONE=C LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C
#>
#> time zone: UTC
#> tzcode source: system (glibc)
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] geokit_0.0.1.9000
#>
#> loaded via a namespace (and not attached):
#> [1] digest_0.6.39 desc_1.4.3 R6_2.6.1 codetools_0.2-20
#> [5] fastmap_1.2.0 xfun_0.56 cachem_1.1.0 knitr_1.51
#> [9] htmltools_0.5.9 rmarkdown_2.30 lifecycle_1.0.5 cli_3.6.5
#> [13] sass_0.4.10 pkgdown_2.2.0 textshaping_1.0.4 jquerylib_0.1.4
#> [17] systemfonts_1.3.1 compiler_4.5.2 tools_4.5.2 ragg_1.5.0
#> [21] curl_7.0.0 evaluate_1.0.5 bslib_0.9.0 yaml_2.3.12
#> [25] jsonlite_2.0.0 rlang_1.1.7 fs_1.6.6