rna_code.data.interface package

Submodules

rna_code.data.interface.BRCA_interface module

Interface with the BRCA dataset.

class rna_code.data.interface.BRCA_interface.BRCAInterface(data_path: Path = PosixPath('/home/runner/work/biosequence_encoding/biosequence_encoding/rna_code/../data/BRCA'), metadata_path: Path = PosixPath('/home/runner/work/biosequence_encoding/biosequence_encoding/rna_code/../data/BRCA/metadata.cart.2023-09-22.json'))

Bases: BaseInterface

Interface wth app and file system for the BRCA dataset.

Parameters:
  • data_path (Path, optional) – Path of the directory containing the data, by default BRCA_DATA_PATH

  • metadata_path (Path, optional) – Path of the metadata file, by default BRCA_METADATA_FILE

property entry_names: list[str]

Get entries names

Returns:

List containing the name for each observation

Return type:

list[str]

find_subtypes()

Find subtypes associated with each observation based on subtype file.

load_patients()

Load patients based on pre computed entries

setup()

Perform all necessary steps to provide with a dataset.

rna_code.data.interface.base_interface module

Base class for interfacing app with file system.

class rna_code.data.interface.base_interface.BaseInterface(data_path: Path, metadata_path: Path)

Bases: ABC

Base class for interfacing app with file system

Parameters:
  • data_path (Path) – Data path

  • metadata_path (Path) – Metadata file path

static get_gene_names_from_file(filename: str, header: int = 0, skiprows: List[int] | None = None) DataFrame

Retrieve a list of gene names from a specified file.

Parameters:
  • filename (str) – Path to the file from which to read the names.

  • header (int, optional) – Row number to use as the header (column names). Defaults to 0.

  • skiprows (list of int, optional) – Rows to skip at the start of the file.

Returns:

A DataFrame containing the names from the file.

Return type:

pd.DataFrame

Raises:
  • FileNotFoundError – If the specified file does not exist.

  • ParserError – If there is an error in parsing the file.

static load_patient_data(filename: str, header: int = 0) Series

Load patient data from a specified file.

Parameters:
  • filename (str) – Path to the data file.

  • header (int, optional) – Row number to use as the header. Defaults to 0.

Returns:

A pandas Series containing TPM values from the file.

Return type:

pd.Series

Raises:

FileNotFoundError – If the specified file does not exist.

static retrieve_position(names, drop_na=False)

Retrieve genomic positions for a list of gene names.

Parameters:
  • names (pd.DataFrame) – DataFrame containing gene names.

  • drop_na (bool, optional) – Flag to drop NA values. Defaults to False.

  • verbose (int, optional) – Verbosity level.

Returns:

DataFrame with retrieved genomic positions and symbols.

Return type:

pd.DataFrame

rna_code.data.interface.cptac_3_interface module

Interface with the CPTAC-3 dataset.

class rna_code.data.interface.cptac_3_interface.CPTAC3Interface(data_path: Path = PosixPath('/home/runner/work/biosequence_encoding/biosequence_encoding/rna_code/../data/CPTAC-3'), metadata_path: Path = PosixPath('/home/runner/work/biosequence_encoding/biosequence_encoding/rna_code/../data/CPTAC-3/metadata.repository.2024-11-07.json'))

Bases: BaseInterface

Interface wth app and file system for the CPTAC-3 dataset.

Parameters:
  • data_path (Path, optional) – Path of the directory containing the data, by default CPTAC_3_DATA_PATH

  • metadata_path (Path, optional) – Path of the metadata file, by default CPTAC_3_METADATA_FILE

property entry_names: list[str]

Get entries names

Returns:

List containing the name for each observation

Return type:

list[str]

find_subtypes()

Find subtypes associated with each observation based on subtype file.

load_patients()

Load patients based on pre computed entries

setup()

Perform all necessary steps to provide with a dataset.

Module contents