IDC Index Architecture Guide¶

This document provides a developer-focused overview of the idc-index codebase architecture.

Project Structure¶

idc-index/
├── idc_index/
│   ├── __init__.py          # Package exports (IDCClient, exceptions)
│   ├── index.py             # Main IDCClient class implementation
│   └── cli.py               # Command-line interface
├── tests/
│   ├── idcindex.py          # Main test suite for IDCClient
│   └── test_package.py      # Package-level tests
└── docs/
    └── ...

Core Class: IDCClient¶

The main class is IDCClient in idc_index/index.py. It provides access to DICOM metadata indices and download functionality.

Initialization (`init`)¶

The constructor (lines ~88-158) performs:

Load bundled indices from idc-index-data package:
- self.index - Main DICOM metadata index (always loaded)
- self.prior_versions_index - Historical version data (always loaded)
Initialize lazy-loaded indices as None:
- self.sm_index - Slide microscopy series metadata
- self.sm_instance_index - Slide microscopy instance metadata
- self.clinical_index - Clinical data index
Discover available indices via _discover_available_indices():
- Populates self.indices_overview dictionary
- Uses INDEX_METADATA from idc-index-data package
Setup utilities:
- self.s5cmdPath - Path to s5cmd executable for downloads
- self._duckdb_conn - Reusable DuckDB connection for SQL queries

Index Management¶

`indices_overview` Dictionary¶

Central registry of all available indices. Structure:

{
    "index": {
        "description": "Main IDC index...",
        "installed": True,           # Always True for bundled
        "url": None,                 # None for bundled indices
        "file_path": "/path/to/index.parquet",
        "schema": {...}              # JSON schema with column definitions
    },
    "sm_index": {
        "description": "Slide microscopy...",
        "installed": True/False,     # True if downloaded
        "url": "https://github.com/.../sm_index.parquet",
        "file_path": "/path/or/None",
        "schema": {...}
    },
    ...
}

Key Methods¶

Method	Location	Purpose
`_discover_available_indices()`	~156-221	Populates `indices_overview` from `INDEX_METADATA`
`refresh_indices_overview()`	~223-233	Refreshes the indices list
`fetch_index(index_name)`	~437-520	Downloads/loads an index and sets class attribute
`get_index_schema(index_name)`	~235-254	Returns JSON schema for an index

Index Loading Flow¶

User calls fetch_index("sm_index")
    │
    ├─► Index already installed & loaded? → Return
    │
    ├─► Index installed but not loaded?
    │       └─► Load from disk: pd.read_parquet(filepath)
    │       └─► setattr(self, "sm_index", dataframe)
    │
    └─► Index not installed?
            └─► Download from URL
            └─► Save to disk
            └─► Load into memory
            └─► Update indices_overview["sm_index"]["installed"] = True

SQL Query System¶

`sql_query()` Method (~2307-2329)¶

Executes SQL queries against loaded indices using DuckDB.

How it works:

Iterates over indices_overview
For each index, checks if DataFrame is loaded (getattr(self, name, None))
Registers loaded DataFrames with DuckDB connection
Executes query and returns pandas DataFrame

DuckDB Integration:

Uses duckdb.connect() for in-process database
conn.register(name, df) creates virtual table from DataFrame (zero-copy)
Re-registration is safe and updates the reference

Data Download Methods¶

The class provides multiple download methods, all following similar patterns:

Method	Purpose
`download_dicom_series()`	Download series by SeriesInstanceUID
`download_dicom_studies()`	Download studies by StudyInstanceUID
`download_dicom_patients()`	Download all data for patient(s)
`download_collection()`	Download entire collection
`download_from_manifest()`	Download from s5cmd manifest file
`download_from_selection()`	Download from IDC Portal selection file

Common patterns:

Use s5cmd for parallel downloads from AWS/GCP
Support directory templates for organizing downloads
Query indices using internal DuckDB SQL

Internal Query Pattern¶

Many methods use DuckDB queries with local DataFrames:

def some_method(self):
    # Create local variable from class attribute
    index = self.index

    # DuckDB finds 'index' in local scope automatically
    sql = """
        SELECT column1, column2
        FROM index
        WHERE condition = 'value'
    """
    result = duckdb.query(sql).df()

This “replacement scan” pattern works because DuckDB inspects the Python call stack to find DataFrames by variable name.

Testing¶

Test Structure¶

Tests are in tests/idcindex.py using unittest.TestCase.

Test class setup:

class TestIDCClient(unittest.TestCase):
    @classmethod
    def setUpClass(cls):
        cls.client = IDCClient()  # Shared client instance

Key test patterns:

Use self.subTest() for parameterized tests with clear failure attribution
Use tempfile.TemporaryDirectory() for download tests
Mock or skip network-dependent tests appropriately

Running Tests¶

# All tests
uv run pytest tests/ -v

# Specific test
uv run pytest tests/idcindex.py::TestIDCClient::test_sql_queries -v

# Tests matching pattern
uv run pytest tests/idcindex.py -k "sql_query" -v

Dependencies¶

External Packages¶

Package	Purpose
`idc-index-data`	Provides INDEX_METADATA, bundled parquet files
`pandas`	DataFrame operations
`duckdb`	SQL query engine
`requests`	HTTP downloads for non-bundled indices
`s5cmd`	Parallel cloud storage downloads
`platformdirs`	Cross-platform data directories

Data Locations¶

# Versioned index data (downloaded indices)
self.indices_data_dir = platformdirs.user_data_dir(
    "idc_index_data", "IDC", version=version("idc-index-data")
)

# IDC data (clinical files, etc.)
self.idc_data_dir = platformdirs.user_data_dir("IDC", "IDC", version=self.idc_version)

Common Development Tasks¶

Adding a New Index¶

Add index metadata to idc-index-data package’s INDEX_METADATA
The index will automatically appear in indices_overview
Users can fetch it with fetch_index("new_index_name")
It will automatically be available in sql_query() after fetching

Modifying SQL Query Behavior¶

The sql_query() method is at ~line 2307. Key considerations:

All indices in indices_overview are automatically registered if loaded
Uses self._duckdb_conn (created in __init__) for connection reuse
conn.register() is zero-copy for pandas DataFrames

Adding New Download Methods¶

Follow existing patterns:

Build SQL query to select relevant rows from self.index
Generate s5cmd manifest file
Execute download using subprocess.run([self.s5cmdPath, ...])