idc_index package API

Copyright (c) 2024 Imaging Data Commons. All rights reserved.

idc-index: Package to query and download data from an index of ImagingDataCommons

Submodules

idc_index.index module

class idc_index.index.IDCClient

Bases: object

CITATION_FORMAT_APA = 'text/x-bibliography; style=apa; locale=en-US'
CITATION_FORMAT_BIBTEX = 'application/x-bibtex'
CITATION_FORMAT_JSON = 'application/vnd.citationstyles.csl+json'
CITATION_FORMAT_TURTLE = 'text/turtle'
DOWNLOAD_HIERARCHY_DEFAULT = '%collection_id/%PatientID/%StudyInstanceUID/%Modality_%SeriesInstanceUID'
citations_from_manifest(manifestFile, citation_format='text/x-bibliography; style=apa; locale=en-US')

Get the list of publications that should be cited/attributed for a cohort defined by a manifest.

Parameters:
  • (str (manifestFile) – string containing the path to the manifest file.

  • format (str) – string containing the format of the citation. Must be one of the following: CITATION_FORMAT_APA, CITATION_FORMAT_BIBTEX, CITATION_FORMAT_JSON. Defaults to CITATION_FORMAT_APA. Can be initialized to the alternative formats as allowed by DOI API, see https://citation.crosscite.org/docs.html#sec-4.

  • manifestFile (str)

  • citation_format (str)

Returns:

List of citations in the requested format.

citations_from_selection(collection_id=None, patientId=None, studyInstanceUID=None, seriesInstanceUID=None, citation_format='text/x-bibliography; style=apa; locale=en-US')

Get the list of publications that should be cited/attributed for the specific collection, patient (case) ID, study or series UID.

Parameters:
  • collection_id – string or list of strings containing the values of collection_id to filter by

  • patientId – string or list of strings containing the values of PatientID to filter by

  • studyInstanceUID (str) – string or list of strings containing the values of DICOM StudyInstanceUID to filter by

  • seriesInstanceUID – string or list of strings containing the values of DICOM SeriesInstanceUID to filter by

  • format – string containing the format of the citation. Must be one of the following: CITATION_FORMAT_APA, CITATION_FORMAT_BIBTEX, CITATION_FORMAT_JSON. Defaults to CITATION_FORMAT_APA. Can be initialized to the alternative formats as allowed by DOI API, see https://citation.crosscite.org/docs.html#sec-4.

Returns:

List of citations in the requested format.

classmethod client()
Return type:

IDCClient

download_collection(collection_id, downloadDir, dry_run=False, quiet=True, show_progress_bar=True, use_s5cmd_sync=False, dirTemplate='%collection_id/%PatientID/%StudyInstanceUID/%Modality_%SeriesInstanceUID', source_bucket_location='aws')

Download the files corresponding to the studyInstanceUID to the specified directory.

Parameters:
  • collection_id – string or list of strings containing the values of DICOM patientId to filter by

  • downloadDir – string containing the path to the directory to download the files to

  • dry_run – calculates the size of the cohort but download does not start

  • quiet (bool) – If True, suppresses the output of the subprocess. Defaults to True.

  • show_progress_bar (bool) – If True, tracks the progress of download

  • use_s5cmd_sync (bool) – If True, will use s5cmd sync operation instead of cp when downloadDirectory is not empty; this can significantly improve the download speed if the content is partially downloaded

  • dirTemplate (str) – Download directory hierarchy template. This variable defines the folder hierarchy for the organizing the downloaded files in downloadDirectory. Defaults to index.DOWNLOAD_HIERARCHY_DEFAULT set to %collection_id/%PatientID/%StudyInstanceUID/%Modality_%SeriesInstanceUID. The template string can be built using a combination of selected metadata attributes (PatientID, collection_id, Modality, StudyInstanceUID, SeriesInstanceUID) that must be prefixed by ‘%’. The following special characters can be used as separators: ‘-’ (hyphen), ‘/’ (slash for subdirectories), ‘_’ (underscore). When set to None all files will be downloaded to the download directory with no subdirectories.

  • source_bucket_location – string selecting the provider of the bucket from which the files will be downloaded, allowing to select between Google (‘gcs’) and AWS (‘aws’) storage. Defaults to ‘aws’.

Return type:

None

Returns: None

Raises:

TypeError – If collection_id(s) passed is(are) not a string or list

download_dicom_instance(sopInstanceUID, downloadDir, dry_run=False, quiet=True, show_progress_bar=True, use_s5cmd_sync=False, dirTemplate='%collection_id/%PatientID/%StudyInstanceUID/%Modality_%SeriesInstanceUID', source_bucket_location='aws')

Download the files corresponding to the sopInstanceUID to the specified directory.

Parameters:
  • sopInstanceUID – string or list of strings containing the values of DICOM SOPInstanceUID to filter by

  • downloadDir – string containing the path to the directory to download the files to

  • dry_run – calculates the size of the cohort but download does not start

  • quiet (bool) – If True, suppresses the output of the subprocess. Defaults to True.

  • show_progress_bar (bool) – If True, tracks the progress of download

  • use_s5cmd_sync (bool) – If True, will use s5cmd sync operation instead of cp when downloadDirectory is not empty; this can significantly improve the download speed if the content is partially downloaded

  • dirTemplate (str) – Download directory hierarchy template. This variable defines the folder hierarchy for the organizing the downloaded files in downloadDirectory. Defaults to index.DOWNLOAD_HIERARCHY_DEFAULT set to %collection_id/%PatientID/%StudyInstanceUID/%Modality_%SeriesInstanceUID. The template string can be built using a combination of selected metadata attributes (PatientID, collection_id, Modality, StudyInstanceUID, SeriesInstanceUID) that must be prefixed by ‘%’. The following special characters can be used as separators: ‘-’ (hyphen), ‘/’ (slash for subdirectories), ‘_’ (underscore). When set to None all files will be downloaded to the download directory with no subdirectories.

  • source_bucket_location – string selecting the provider of the bucket from which the files will be downloaded, allowing to select between Google (‘gcs’) and AWS (‘aws’) storage. Defaults to ‘aws’.

Return type:

None

Returns: None

Raises:

TypeError – If sopInstanceUID(s) passed is(are) not a string or list

download_dicom_patients(patientId, downloadDir, dry_run=False, quiet=True, show_progress_bar=True, use_s5cmd_sync=False, dirTemplate='%collection_id/%PatientID/%StudyInstanceUID/%Modality_%SeriesInstanceUID', source_bucket_location='aws')

Download the files corresponding to the studyInstanceUID to the specified directory.

Parameters:
  • patientId – string or list of strings containing the values of DICOM patientId to filter by

  • downloadDir – string containing the path to the directory to download the files to

  • dry_run – calculates the size of the cohort but download does not start

  • quiet (bool) – If True, suppresses the output of the subprocess. Defaults to True.

  • show_progress_bar (bool) – If True, tracks the progress of download

  • use_s5cmd_sync (bool) – If True, will use s5cmd sync operation instead of cp when downloadDirectory is not empty; this can significantly improve the download speed if the content is partially downloaded

  • dirTemplate (str) – Download directory hierarchy template. This variable defines the folder hierarchy for the organizing the downloaded files in downloadDirectory. Defaults to index.DOWNLOAD_HIERARCHY_DEFAULT set to %collection_id/%PatientID/%StudyInstanceUID/%Modality_%SeriesInstanceUID. The template string can be built using a combination of selected metadata attributes (PatientID, collection_id, Modality, StudyInstanceUID, SeriesInstanceUID) that must be prefixed by ‘%’. The following special characters can be used as separators: ‘-’ (hyphen), ‘/’ (slash for subdirectories), ‘_’ (underscore). When set to None all files will be downloaded to the download directory with no subdirectories.

  • source_bucket_location – string selecting the provider of the bucket from which the files will be downloaded, allowing to select between Google (‘gcs’) and AWS (‘aws’) storage. Defaults to ‘aws’.

Return type:

None

Returns: None

Raises:

TypeError – If patientId(s) passed is(are) not a string or list

download_dicom_series(seriesInstanceUID, downloadDir, dry_run=False, quiet=True, show_progress_bar=True, use_s5cmd_sync=False, dirTemplate='%collection_id/%PatientID/%StudyInstanceUID/%Modality_%SeriesInstanceUID', source_bucket_location='aws')

Download the files corresponding to the seriesInstanceUID to the specified directory.

Parameters:
  • seriesInstanceUID – string or list of strings containing the values of DICOM SeriesInstanceUID to filter by

  • downloadDir – string containing the path to the directory to download the files to

  • dry_run – calculates the size of the cohort but download does not start

  • quiet (bool) – If True, suppresses the output of the subprocess. Defaults to True.

  • show_progress_bar (bool) – If True, tracks the progress of download

  • use_s5cmd_sync (bool) – If True, will use s5cmd sync operation instead of cp when downloadDirectory is not empty; this can significantly improve the download speed if the content is partially downloaded

  • dirTemplate (str) – Download directory hierarchy template. This variable defines the folder hierarchy for the organizing the downloaded files in downloadDirectory. Defaults to index.DOWNLOAD_HIERARCHY_DEFAULT set to %collection_id/%PatientID/%StudyInstanceUID/%Modality_%SeriesInstanceUID. The template string can be built using a combination of selected metadata attributes (PatientID, collection_id, Modality, StudyInstanceUID, SeriesInstanceUID) that must be prefixed by ‘%’. The following special characters can be used as separators: ‘-’ (hyphen), ‘/’ (slash for subdirectories), ‘_’ (underscore). When set to None all files will be downloaded to the download directory with no subdirectories.

  • source_bucket_location – string selecting the provider of the bucket from which the files will be downloaded, allowing to select between Google (‘gcs’) and AWS (‘aws’) storage. Defaults to ‘aws’.

Return type:

None

Returns: None

Raises:

TypeError – If seriesInstanceUID(s) passed is(are) not a string or list

download_dicom_studies(studyInstanceUID, downloadDir, dry_run=False, quiet=True, show_progress_bar=True, use_s5cmd_sync=False, dirTemplate='%collection_id/%PatientID/%StudyInstanceUID/%Modality_%SeriesInstanceUID', source_bucket_location='aws')

Download the files corresponding to the studyInstanceUID to the specified directory.

Parameters:
  • studyInstanceUID – string or list of strings containing the values of DICOM studyInstanceUID to filter by

  • downloadDir – string containing the path to the directory to download the files to

  • dry_run – calculates the size of the cohort but download does not start

  • quiet (bool) – If True, suppresses the output of the subprocess. Defaults to True.

  • show_progress_bar (bool) – If True, tracks the progress of download

  • use_s5cmd_sync (bool) – If True, will use s5cmd sync operation instead of cp when downloadDirectory is not empty; this can significantly improve the download speed if the content is partially downloaded

  • dirTemplate (str) – Download directory hierarchy template. This variable defines the folder hierarchy for the organizing the downloaded files in downloadDirectory. Defaults to index.DOWNLOAD_HIERARCHY_DEFAULT set to %collection_id/%PatientID/%StudyInstanceUID/%Modality_%SeriesInstanceUID. The template string can be built using a combination of selected metadata attributes (PatientID, collection_id, Modality, StudyInstanceUID, SeriesInstanceUID) that must be prefixed by ‘%’. The following special characters can be used as separators: ‘-’ (hyphen), ‘/’ (slash for subdirectories), ‘_’ (underscore). When set to None all files will be downloaded to the download directory with no subdirectories.

  • source_bucket_location – string selecting the provider of the bucket from which the files will be downloaded, allowing to select between Google (‘gcs’) and AWS (‘aws’) storage. Defaults to ‘aws’.

Return type:

None

Returns: None

Raises:

TypeError – If seriesInstanceUID(s) passed is(are) not a string or list

download_from_manifest(manifestFile, downloadDir, quiet=True, validate_manifest=True, show_progress_bar=True, use_s5cmd_sync=False, dirTemplate='%collection_id/%PatientID/%StudyInstanceUID/%Modality_%SeriesInstanceUID', progress_callback=None)

Download the manifest file. In a series of steps, the manifest file is first validated to ensure every line contains a valid urls. It then gets the total size to be downloaded and runs download process on one process and download progress on another process.

Parameters:
  • manifestFile (str) – The path to the manifest file.

  • downloadDir (str) – The directory to download the files to.

  • quiet (bool) – If True, suppresses the output of the subprocess. Defaults to True.

  • validate_manifest (bool) – If True, validates the manifest for any errors. Defaults to True.

  • show_progress_bar (bool) – If True, tracks the progress of download

  • use_s5cmd_sync (bool) – If True, will use s5cmd sync operation instead of cp when downloadDirectory is not empty; this can significantly improve the download speed if the content is partially downloaded

  • dirTemplate (str) – Download directory hierarchy template. This variable defines the folder hierarchy for the organizing the downloaded files in downloadDirectory. Defaults to index.DOWNLOAD_HIERARCHY_DEFAULT set to %collection_id/%PatientID/%StudyInstanceUID/%Modality_%SeriesInstanceUID. The template string can be built using a combination of selected metadata attributes (PatientID, collection_id, Modality, StudyInstanceUID, SeriesInstanceUID) that must be prefixed by ‘%’. The following special characters can be used as separators: ‘-’ (hyphen), ‘/’ (slash for subdirectories), ‘_’ (underscore). When set to None all files will be downloaded to the download directory with no subdirectories.

  • progress_callback (Callable[[float, float, str, str], None]) – Optional; A callback function that takes four parameters: downloaded_bytes (float), total_bytes (float), unit (str), description (str). This function will be called periodically if show_progress_bar is True to report download progress. If None, a default progress bar will be displayed using tqdm.

Raises:
Return type:

None

download_from_selection(downloadDir, dry_run=False, collection_id=None, patientId=None, studyInstanceUID=None, seriesInstanceUID=None, sopInstanceUID=None, crdc_series_uuid=None, quiet=True, show_progress_bar=True, use_s5cmd_sync=False, dirTemplate='%collection_id/%PatientID/%StudyInstanceUID/%Modality_%SeriesInstanceUID', source_bucket_location='aws')

Download the files corresponding to the selection. The filtering will be applied in sequence (but does it matter?) by first selecting the collection(s), followed by patient(s), study(studies) and series. If no filtering is applied, all the files will be downloaded.

Parameters:
  • downloadDir – string containing the path to the directory to download the files to

  • dry_run – calculates the size of the cohort but download does not start

  • collection_id – string or list of strings containing the values of collection_id to filter by

  • patientId – string or list of strings containing the values of PatientID to filter by

  • studyInstanceUID – string or list of strings containing the values of DICOM StudyInstanceUID to filter by

  • seriesInstanceUID – string or list of strings containing the values of DICOM SeriesInstanceUID to filter by

  • sopInstanceUID – string or list of strings containing the values of DICOM SOPInstanceUID to filter by

  • crdc_series_uuid – string or list of strings containing the values of crdc_series_uuid to filter by

  • quiet (bool) – If True, suppresses the output of the subprocess. Defaults to True

  • show_progress_bar (bool) – If True, tracks the progress of download

  • use_s5cmd_sync (bool) – If True, will use s5cmd sync operation instead of cp when downloadDirectory is not empty; this can significantly improve the download speed if the content is partially downloaded

  • dirTemplate (str) – Download directory hierarchy template. This variable defines the folder hierarchy for the organizing the downloaded files in downloadDirectory. Defaults to index.DOWNLOAD_HIERARCHY_DEFAULT set to %collection_id/%PatientID/%StudyInstanceUID/%Modality_%SeriesInstanceUID. The template string can be built using a combination of selected metadata attributes (PatientID, collection_id, Modality, StudyInstanceUID, SeriesInstanceUID) that must be prefixed by ‘%’. The following special characters can be used as separators: ‘-’ (hyphen), ‘/’ (slash for subdirectories), ‘_’ (underscore). When set to None all files will be downloaded to the download directory with no subdirectories.

  • source_bucket_location – string selecting the provider of the bucket from which the files will be downloaded, allowing to select between Google (‘gcs’) and AWS (‘aws’) storage. Defaults to ‘aws’.

fetch_index(index_name)

Downloads requested index and adds this index joined with the main index as respective class attribute.

Parameters:

index (str) – Name of the index to be downloaded.

Return type:

None

get_clinical_table(table_name)

Returns the requested clinical table as a pandas DataFrame.

Parameters:

table_name (str) – Name of the clinical table to be loaded.

Returns:

The requested clinical table.

Return type:

pandas.DataFrame

get_collections()

Returns the collections present in IDC.

get_dicom_series(studyInstanceUID, outputFormat='dict')

Returns Series for a given study or list of studies.

Parameters:
  • studyInstanceUID (str or list of str) – The DICOM StudyInstanceUID or a list of StudyInstanceUIDs.

  • outputFormat (str) – The format in which to return the series. Available options are ‘dict’, ‘df’, and ‘list’. Default is ‘dict’.

Returns:

Series in the requested output format. By default, it returns a dictionary.

Return type:

dict or pandas.DataFrame or list

Raises:
  • ValueError – If outputFormat is not one of ‘dict’, ‘df’, ‘list’.

  • ValueError – If any of the studyInstanceUID does not exist.

get_dicom_studies(patientId, outputFormat='dict')

Returns Studies for a given patient or list of patients.

Parameters:
  • patientId (str or list of str) – The patient Id or a list of patient Ids.

  • outputFormat (str) – The format in which to return the studies. Available options are ‘dict’, ‘df’, and ‘list’. Default is ‘dict’.

Returns:

Studies in the requested output format. By default, it returns a dictionary.

Return type:

dict or pandas.DataFrame or list

Raises:
  • ValueError – If outputFormat is not one of ‘dict’, ‘df’, ‘list’.

  • ValueError – If any of the patientId does not exist.

static get_idc_version()

Returns the version of IDC data used in idc-index.

get_index_schema(index_name)

Get the full schema for an index, including column definitions.

This method returns the JSON schema for the specified index. The schema includes table_description and column definitions with name, type, mode, and description. Schemas are loaded from INDEX_METADATA during discovery.

Parameters:

index_name (str) – The name of the index to get the schema for.

Returns:

The schema dictionary containing ‘table_description’ and

’columns’, or None if the schema is not available.

Return type:

dict or None

get_instance_file_URL(sopInstanceUID, source_bucket_location='aws')

Get the bucket URL of the file corresponding to a given SOPInstanceUID.

This function will only return the URL for the Slide Microscopy (SM) instances, which are maintained in the sm_instance_index table.

Parameters:
  • sopInstanceUID – string containing the value of DICOM SOPInstanceUID

  • source_bucket_location – string containing the source bucket location, either “aws” or “gcs”

Returns:

string containing the bucket URL of the file corresponding to the SOPInstanceUID, or None if the SOPInstanceUID is not recognized

get_patients(collection_id, outputFormat='dict')

Gets the patients in a collection.

Parameters:
  • collection_id (str or list[str]) – The collection id or list of collection ids. This should be in lower case separated by underscores. For example, ‘pdmr_texture_analysis’. or [‘pdmr_texture_analysis’,’nlst’]

  • outputFormat (str) – The format in which to return the patient IDs. Available options are ‘dict’, ‘df’, and ‘list’. Default is ‘dict’.

Returns:

Patient IDs in the requested output format. By default, it returns a dictionary.

Return type:

dict or pandas.DataFrame or list

Raises:

ValueError – If outputFormat is not one of ‘dict’, ‘df’, ‘list’.

get_series_file_URLs(seriesInstanceUID, source_bucket_location='aws')

Get the URLs of the files corresponding to the DICOM instances in a given SeriesInstanceUID.

Parameters:
  • seriesInstanceUID – string containing the value of DICOM SeriesInstanceUID to filter by

  • source_bucket_location – string containing the source bucket location, either “aws” or “gcs”

Returns:

list of strings containing the AWS S3 URLs of the files corresponding to the SeriesInstanceUID

get_series_size(seriesInstanceUID)

Gets cumulative size (MB) of the DICOM instances in a given SeriesInstanceUID.

Parameters:

seriesInstanceUID (str) – The DICOM SeriesInstanceUID.

Returns:

The cumulative size of the DICOM instances in the given SeriesInstanceUID rounded to two digits, in MB.

Return type:

float

Raises:

ValueError – If the seriesInstanceUID does not exist.

get_viewer_URL(seriesInstanceUID=None, studyInstanceUID=None, viewer_selector=None)

Get the URL of the IDC viewer for the given series or study in IDC based on the provided SeriesInstanceUID or StudyInstanceUID. If StudyInstanceUID is not provided, it will be automatically deduced. If viewer_selector is not provided, default viewers will be used (OHIF v3 for radiology modalities, and Slim for SM).

This function will validate the provided SeriesInstanceUID or StudyInstanceUID against IDC index to ensure that the series or study is available in IDC.

Parameters:
  • SeriesInstanceUID – string containing the value of DICOM SeriesInstanceUID for a series available in IDC

  • StudyInstanceUID – string containing the value of DICOM SeriesInstanceUID for a series available in IDC

  • viewer_selector – string containing the name of the viewer to use. Must be one of the following: ohif_v2, ohif_v3, or slim. If not provided, default viewers will be used: slim for studies that contain SM modality and ohif_v3 for radiology.

Returns:

string containing the IDC viewer URL for the requested selection

refresh_indices_overview()

Refresh the list of available indices from INDEX_METADATA.

This method re-populates the indices_overview dictionary from INDEX_METADATA. Kept for API compatibility, but no longer performs network calls.

Returns:

The refreshed indices_overview dictionary.

Return type:

dict

sql_query(sql_query)

Execute SQL query against the table in the index using duckdb.

Parameters:

sql_query – string containing the SQL query to execute. Table names available: ‘index’, ‘prior_versions_index’, and any installed index (e.g., ‘sm_index’, ‘clinical_index’).

Returns:

pandas dataframe containing the results of the query

Raises:

duckdb.Error – any exception that duckdb.query() raises

exception idc_index.index.IDCClientInsufficientDiskSpaceError(disk_space_needed, disk_space_available, message=None)

Bases: Exception

Exception raised when there is insufficient disk space for download.

Parameters:
  • disk_space_needed (str)

  • disk_space_available (str)

  • message (str)

idc_index.cli module

CLI module for the IDC client.

This module provides command-line interface (CLI) commands to interact with the Imaging Data Commons (IDC) data.

idc_index.cli.set_log_level(log_level)

Set the logging level for the CLI module.

Parameters:

log_level (str) – The logging level to set.