# IDC Index Architecture Guide

This document provides a developer-focused overview of the `idc-index` codebase
architecture.

## Project Structure

```text
idc-index/
├── idc_index/
│   ├── __init__.py          # Package exports (IDCClient, exceptions)
│   ├── index.py             # Main IDCClient class implementation
│   └── cli.py               # Command-line interface
├── tests/
│   ├── idcindex.py          # Main test suite for IDCClient
│   └── test_package.py      # Package-level tests
└── docs/
    └── ...
```

## Core Class: IDCClient

The main class is `IDCClient` in `idc_index/index.py`. It provides access to
DICOM metadata indices and download functionality.

### Initialization (`__init__`)

The constructor (lines ~88-158) performs:

1. **Load bundled indices** from `idc-index-data` package:
   - `self.index` - Main DICOM metadata index (always loaded)
   - `self.prior_versions_index` - Historical version data (always loaded)

2. **Initialize lazy-loaded indices** as `None`:
   - `self.sm_index` - Slide microscopy series metadata
   - `self.sm_instance_index` - Slide microscopy instance metadata
   - `self.clinical_index` - Clinical data index

3. **Discover available indices** via `_discover_available_indices()`:
   - Populates `self.indices_overview` dictionary
   - Uses `INDEX_METADATA` from `idc-index-data` package

4. **Setup utilities**:
   - `self.s5cmdPath` - Path to s5cmd executable for downloads
   - `self._duckdb_conn` - Reusable DuckDB connection for SQL queries

### Index Management

#### `indices_overview` Dictionary

Central registry of all available indices. Structure:

```text
{
    "index": {
        "description": "Main IDC index...",
        "installed": True,           # Always True for bundled
        "url": None,                 # None for bundled indices
        "file_path": "/path/to/index.parquet",
        "schema": {...}              # JSON schema with column definitions
    },
    "sm_index": {
        "description": "Slide microscopy...",
        "installed": True/False,     # True if downloaded
        "url": "https://github.com/.../sm_index.parquet",
        "file_path": "/path/or/None",
        "schema": {...}
    },
    ...
}
```

#### Key Methods

| Method                          | Location | Purpose                                            |
| ------------------------------- | -------- | -------------------------------------------------- |
| `_discover_available_indices()` | ~156-221 | Populates `indices_overview` from `INDEX_METADATA` |
| `refresh_indices_overview()`    | ~223-233 | Refreshes the indices list                         |
| `fetch_index(index_name)`       | ~437-520 | Downloads/loads an index and sets class attribute  |
| `get_index_schema(index_name)`  | ~235-254 | Returns JSON schema for an index                   |

#### Index Loading Flow

```text
User calls fetch_index("sm_index")
    │
    ├─► Index already installed & loaded? → Return
    │
    ├─► Index installed but not loaded?
    │       └─► Load from disk: pd.read_parquet(filepath)
    │       └─► setattr(self, "sm_index", dataframe)
    │
    └─► Index not installed?
            └─► Download from URL
            └─► Save to disk
            └─► Load into memory
            └─► Update indices_overview["sm_index"]["installed"] = True
```

### SQL Query System

#### `sql_query()` Method (~2307-2329)

Executes SQL queries against loaded indices using DuckDB.

**How it works:**

1. Iterates over `indices_overview`
2. For each index, checks if DataFrame is loaded (`getattr(self, name, None)`)
3. Registers loaded DataFrames with DuckDB connection
4. Executes query and returns pandas DataFrame

**DuckDB Integration:**

- Uses `duckdb.connect()` for in-process database
- `conn.register(name, df)` creates virtual table from DataFrame (zero-copy)
- Re-registration is safe and updates the reference

### Data Download Methods

The class provides multiple download methods, all following similar patterns:

| Method                      | Purpose                                 |
| --------------------------- | --------------------------------------- |
| `download_dicom_series()`   | Download series by SeriesInstanceUID    |
| `download_dicom_studies()`  | Download studies by StudyInstanceUID    |
| `download_dicom_patients()` | Download all data for patient(s)        |
| `download_collection()`     | Download entire collection              |
| `download_from_manifest()`  | Download from s5cmd manifest file       |
| `download_from_selection()` | Download from IDC Portal selection file |

**Common patterns:**

- Use `s5cmd` for parallel downloads from AWS/GCP
- Support directory templates for organizing downloads
- Query indices using internal DuckDB SQL

### Internal Query Pattern

Many methods use DuckDB queries with local DataFrames:

```python
def some_method(self):
    # Create local variable from class attribute
    index = self.index

    # DuckDB finds 'index' in local scope automatically
    sql = """
        SELECT column1, column2
        FROM index
        WHERE condition = 'value'
    """
    result = duckdb.query(sql).df()
```

This "replacement scan" pattern works because DuckDB inspects the Python call
stack to find DataFrames by variable name.

## Testing

### Test Structure

Tests are in `tests/idcindex.py` using `unittest.TestCase`.

**Test class setup:**

```python
class TestIDCClient(unittest.TestCase):
    @classmethod
    def setUpClass(cls):
        cls.client = IDCClient()  # Shared client instance
```

**Key test patterns:**

- Use `self.subTest()` for parameterized tests with clear failure attribution
- Use `tempfile.TemporaryDirectory()` for download tests
- Mock or skip network-dependent tests appropriately

### Running Tests

```bash
# All tests
uv run pytest tests/ -v

# Specific test
uv run pytest tests/idcindex.py::TestIDCClient::test_sql_queries -v

# Tests matching pattern
uv run pytest tests/idcindex.py -k "sql_query" -v
```

## Dependencies

### External Packages

| Package          | Purpose                                        |
| ---------------- | ---------------------------------------------- |
| `idc-index-data` | Provides INDEX_METADATA, bundled parquet files |
| `pandas`         | DataFrame operations                           |
| `duckdb`         | SQL query engine                               |
| `requests`       | HTTP downloads for non-bundled indices         |
| `s5cmd`          | Parallel cloud storage downloads               |
| `platformdirs`   | Cross-platform data directories                |

### Data Locations

```python
# Versioned index data (downloaded indices)
self.indices_data_dir = platformdirs.user_data_dir(
    "idc_index_data", "IDC", version=version("idc-index-data")
)

# IDC data (clinical files, etc.)
self.idc_data_dir = platformdirs.user_data_dir("IDC", "IDC", version=self.idc_version)
```

## Common Development Tasks

### Adding a New Index

1. Add index metadata to `idc-index-data` package's `INDEX_METADATA`
2. The index will automatically appear in `indices_overview`
3. Users can fetch it with `fetch_index("new_index_name")`
4. It will automatically be available in `sql_query()` after fetching

### Modifying SQL Query Behavior

The `sql_query()` method is at ~line 2307. Key considerations:

- All indices in `indices_overview` are automatically registered if loaded
- Uses `self._duckdb_conn` (created in `__init__`) for connection reuse
- `conn.register()` is zero-copy for pandas DataFrames

### Adding New Download Methods

Follow existing patterns:

1. Build SQL query to select relevant rows from `self.index`
2. Generate s5cmd manifest file
3. Execute download using `subprocess.run([self.s5cmdPath, ...])`