cas.utils package

Submodules

cas.utils.conversion_utils module

cas.utils.conversion_utils.calculate_labelset_rank(input_list)[source]

Assign ranks to items in a list.

Parameters:

input_list (List[str]) – The list of items.

Returns:

A dictionary where keys are items from the input list and values are their corresponding ranks (0-based).

Return type:

Dict[str, int]

cas.utils.conversion_utils.calculate_labelset(obs, labelsets)[source]

Calculates labelset dictionary based on the provided observations.

Parameters:
  • obs (pd.DataFrame) – DataFrame containing observations.

  • labelsets (List[str]) – List of labelsets.

Returns:

A dictionary where keys are labelsets and values are dictionaries containing:
  • ”members”: set of members for the labelset.

  • ”rank”: rank of the labelset.

Return type:

Dict[str, Dict[str, Any]]

cas.utils.conversion_utils.add_labelsets_to_cas(cas, labelset_dict)[source]

Updates a CAS dictionary with labelsets derived from a labelset information dictionary.

Parameters:
  • cas (Dict[str, Any]) – The CAS dictionary to update.

  • labelset_dict (Dict[str, Dict[str, Any]]) – Contains labelset names as keys and dicts with ‘rank’ (and potentially other info) as values.

cas.utils.conversion_utils.get_cell_ids(anndata_obs, labelset, cell_label)[source]

Get cell IDs from an AnnData dataset based on a specified labelset and cell label.

Parameters:
  • anndata_obs (DataFrame) – The observations DataFrame (obs) of an AnnData object containing the dataset. This DataFrame should include columns for cell type ontology term IDs and cell types.

  • labelset (str) – Labelset to filter.

  • cell_label (str) – The value of the cell label used to filter rows in anndata_obs.

Returns:

List of cell IDs.

Return type:

List[str]

cas.utils.conversion_utils.get_cl_annotations_from_anndata(anndata_obs, columns_name, cell_label)[source]
Retrieves cell ontology term ID and cell ontology term for a given cell label from the observation DataFrame

of an AnnData object.

Parameters:
  • anndata_obs (DataFrame) – The observations DataFrame (obs) of an AnnData object containing the dataset. This DataFrame should include columns for cell type ontology term IDs and cell types.

  • columns_name (str) – The name of the column in anndata_obs used for filtering based on the cell label.

  • cell_label (str) – The value of the cell label used to filter rows in anndata_obs.

Returns:

A tuple containing two elements:
  • The first element is the cell type ontology term ID associated with the given cell label.

  • The second element is the cell type (ontology term) associated with the given cell label.

Return type:

tuple

cas.utils.conversion_utils.collect_parent_cell_ids(cas)[source]

Collects parent cell IDs from the given CAS data.

This function iterates through labelsets in the CAS data and collects parent cell IDs associated with each labelset annotation. It populates and returns a dictionary mapping parent cell set accessions to sets of corresponding cell IDs.

Parameters:

cas (Dict[str, Any]) – The Cell Annotation Schema data containing labelsets and annotations.

Return type:

Dict[str, Set]

Returns:

A dictionary mapping parent cell set accessions to sets of corresponding cell IDs.

cas.utils.conversion_utils.generate_parent_cell_lookup(anndata, labelset_dict, accessions_mapping=None)[source]

Generates a lookup dictionary mapping cell labels to various metadata, including cell IDs, rank, and cell ontology terms. This function is designed to precompute the lookup information needed for CAS annotation generation, especially useful when hierarchy inclusion is desired.

Parameters:
  • anndata (ad.AnnData) – The AnnData object containing the single-cell dataset, including metadata in anndata.obs.

  • labelset_dict (Dict[str, Any]) – A dictionary where keys are labelset names and values are dictionaries containing members and their ranks.

  • accessions_mapping (Dict[str, str], optional) – Mapping of cellset names to accession IDs.

  • labelsets ((To enable usage of same names accross different) – cell_label).

  • labelset (key is identified as) – cell_label).

Returns:

A dictionary where each key is a cell label and each value is another

dictionary containing keys for ‘cell_ids’ (a set of cell IDs associated with the label), ‘rank’, ‘cell_ontology_term_id’, and ‘cell_ontology_term’.

Return type:

Dict[str, Any]

cas.utils.conversion_utils.update_parent_info(value, parent_key, parent_value)[source]

Updates parent information in a child item’s dictionary.

Parameters:
  • value (Dict[str, Any]) – The child item’s dictionary to be updated.

  • parent_key (str) – The key of the parent item.

  • parent_value (Dict[str, Any]) – The parent item’s dictionary.

This function modifies value to include parent (using parent_key), p_accession, and parent_rank based on parent_value.

cas.utils.conversion_utils.add_parent_cell_hierarchy(parent_cell_look_up)[source]

Processes parent cell hierarchy information and updates CAS dictionary annotations accordingly.

Parameters:

parent_cell_look_up (Dict[str, Any]) – Dictionary containing parent cell information.

Returns:

None

cas.utils.conversion_utils.add_parent_hierarchy_to_annotations(cas, parent_cell_look_up)[source]

Adds parent hierarchy information to annotations in the CAS dictionary.

Parameters:
  • cas (Dict[str, Any]) – The CAS dictionary containing annotations.

  • parent_cell_look_up (Dict[str, Any]) – Dictionary containing parent cell information.

Returns:

None

cas.utils.conversion_utils.get_authors_from_doi(doi)[source]

Fetches and returns a list of authors from a given DOI (Digital Object Identifier) using the CrossRef API.

Parameters:

doi (str) – The DOI of the publication for which to retrieve author information.

Returns:

A list of dictionaries where each dictionary contains details of one author, including

their name (‘author_name’), ORCID ID (‘orcid’), GitHub username (‘github_username’), and email (‘email’). Each field is a string, and fields without data will be None.

Return type:

list of dict

Raises:

KeyError – If the author data is not found in the response, indicating a potential issue with the DOI or the data format.

cas.utils.conversion_utils.reformat_json(input_json, input_key='annotations', exclude_key='cell_ids')[source]

Reformat the input JSON to create a new JSON structure, copying all fields and modifying the ‘input_key’ field. This function serializes the modified JSON to a string.

Parameters:
  • input_json (Dict[str, Any]) – The original JSON object as a Python dictionary.

  • input_key (str) – The key in the original JSON where annotations are stored.

  • exclude_key (str) – The key within annotations to exclude from the copied data.

Return type:

str

Returns:

A JSON string of the reformatted JSON object.

cas.utils.conversion_utils.convert_complex_type(value)[source]

Converts all complex types to strings except for bool, int, float, and str. - Leaves bool types (including numpy.bool_) unchanged. - Converts everything else to strings.

cas.utils.conversion_utils.copy_and_update_file_path(anndata_file_path, output_file_path)[source]

Copies the AnnData file to a new location if an output file path is provided, and updates the file path.

Parameters:
  • anndata_file_path (str) – The path to the original AnnData file.

  • output_file_path (Optional[str]) – The path to which the file should be copied. If not provided, no copying occurs.

Returns:

The updated file path. If output_file_path is provided, it will return the new path, otherwise the original anndata_file_path.

Return type:

str

cas.utils.conversion_utils.fetch_anndata(input_json, download_dir=None)[source]

Fetches the AnnData file based on the provided CAS JSON input.

Parameters:
  • input_json (Dict[str, Any]) – A dictionary containing CAS JSON data. Must include a “matrix_file_id” key.

  • download_dir (Optional[str]) – The directory where the AnnData file should be downloaded. If not provided, the current working directory is used.

Returns:

The path to the downloaded AnnData file.

Return type:

str

Raises:

KeyError – If the “matrix_file_id” key is missing from the input_json.

cas.utils.conversion_utils.retrieve_schema(schema_name)[source]
cas.utils.conversion_utils.create_accession_mapping(adata_obs, labelsets, accession_columns)[source]

Creates a mapping of cellset names to accession IDs based on the provided labelsets and accession columns. :type adata_obs: DataFrame :param adata_obs: The observations DataFrame (obs) of an AnnData object containing the dataset. :type labelsets: list :param labelsets: List of labelset names to be used for mapping. :type accession_columns: list :param accession_columns: List of columns in the AnnData obs that contain accession information.

Returns: Map of cellset names to accession IDs, where keys are formatted as “labelset:cell_label”.

Return type:

Optional[Dict[str, str]]

Parameters:
  • adata_obs (DataFrame)

  • labelsets (list)

  • accession_columns (list)

cas.utils.validation_utils module

cas.utils.validation_utils.validate_markers(cas, adata, marker_column)[source]

Validates if the specified marker column in the anndata DataFrame contains all markers mentioned in the ‘annotations’ in CAS dictionary. Raises an exception if the marker column does not exist.

Parameters:
  • cas (Dict[str, Any]) – A dictionary containing various configurations and annotations, including marker gene evidence.

  • adata (DataFrame) – An anndata DataFrame.

  • marker_column (str) – The name of the column in adata.var which contains the list of marker genes to be validated.

Return type:

bool

Returns:

Returns True if all markers are validated without any issue, False or an exception otherwise.

Raises:

KeyError – If the specified marker_column is not found in adata.var.

Note

This function uses the validate_labelset_markers to perform the actual validation per annotation entry in cas.

cas.utils.validation_utils.validate_labelset_markers(annotation, marker_list)[source]

Validates if the markers from a specific annotation are present in the provided marker list. Logs a warning if any markers are missing.

Parameters:
  • annotation (Dict[str, Any]) – A single annotation entry from CAS.

  • marker_list (List[str]) – A list of marker genes to be checked against the markers in the annotation.

Returns:

This function does not return a value but will log a warning if validation fails.

Note

This function is intended to be used within validate_markers to handle individual annotation validation.

cas.utils.validation_utils.validate_labelset_hierarchy(cas, obs, validate=False)[source]

Validates the labelset hierarchy by performing multiple consistency checks between CAS and obs.

This function runs three validation checks: 1. Ensures all labelsets from CAS exist in obs. 2. Verifies that all labelset values from CAS annotations exist in the corresponding obs columns. 3. Checks if the inferred parent-child hierarchy from obs matches CAS-defined ranks.

If any of these checks fail, warnings or errors will be logged. If validate=True, the process will terminate with sys.exit(1) if any validation check fails.

Parameters:
  • cas (Dict[str, Any]) – The CAS JSON object containing labelset definitions and annotations.

  • obs (Union[pd.DataFrame, CapAnnDataDF]) – The AnnData obs DataFrame or a CapAnnDataDF object.

  • validate (bool, optional) – If True, exits the program with an error code if any validation fails. Defaults to False.

Return type:

None

Returns:

None

cas.utils.validation_utils.compare_labelsets_cas_obs(cas, obs)[source]

Compare labelsets from CAS JSON object with the columns of the obs DataFrame.

Logs a warning if any labelsets from CAS are missing in obs.

Parameters:
  • cas (Dict[str, Any]) – The CAS JSON object.

  • obs (Union[DataFrame, CapAnnDataDF]) – The AnnData obs DataFrame.

Return type:

bool

Returns:

True if all labelsets from CAS exist as columns in obs, otherwise False.

cas.utils.validation_utils.validate_labelset_values(cas, obs)[source]

Validate that all labelset members from CAS annotations exist in the corresponding obs labelset columns.

Logs warnings for any missing labelset members.

Parameters:
  • cas (Dict[str, Any]) – The CAS JSON object.

  • obs (pd.DataFrame) – The AnnData obs DataFrame.

Return type:

bool

Returns:

True if all labelset members from CAS exist in obs, otherwise False.

cas.utils.validation_utils.check_parent_child_consistency(cas, obs)[source]

Checks if the inferred hierarchy from cell labels in obs matches the expected hierarchy from CAS rank data.

Parameters:
  • cas (Dict[str, Any]) – The CAS JSON object containing labelset ranks.

  • obs (pd.DataFrame) – The AnnData obs DataFrame.

Return type:

bool

Returns:

True if all inferred parent-child relationships from obs match those from cas,

otherwise False.

cas.utils.validation_utils.infer_obs_cell_hierarchy(obs, cas_ranks)[source]

Infers a direct parent-child hierarchy between cell labels based on row co-occurrence.

This function analyzes the hierarchical relationships between cell labels by comparing their row indices in the obs DataFrame. A label is considered a child if its row indices are fully contained within another label’s indices. The closest (direct) parent is selected based on CAS-defined ranks.

Parameters:
  • obs (pd.DataFrame) – The AnnData obs DataFrame containing labelset columns.

  • cas_ranks (Dict[str, int]) – A dictionary mapping cell labels to their CAS-defined rank, where lower values indicate higher ranks.

Returns:

A dictionary mapping each cell label to its inferred direct parent.

Labels without a parent are assigned None.

Return type:

Dict[Any, Optional[Any]]

cas.utils.validation_utils.infer_cas_cell_hierarchy(cas)[source]
Return type:

Dict[str, Optional[str]]

Parameters:

cas (Dict[str, Any])

Module contents