Querying against the API endpoints

jgarcia · April 12, 2024, 8:07am

Hello! I was hoping to split up the data similarly as done in the train/test TSV based on the year samples were collected. Is this possible with the API endpoints?

For instance, If I wanted to grab the training data from the plasma_ab_titer, I could do something like:

https://www.cmi-pb.org/api/v4_1/plasma_ab_titer?select=*,specimen!inner(subject(dataset))&specimen.subject.dataset=eq.2020_dataset

I am fine with doing this sort of logic from the application level, but I was wondering if similar functionality would be supported

Pramod · April 12, 2024, 6:19pm

Hi @jgarcia ,

Thank you for your inquiry about splitting data directly through the API. We are currently reviewing the capabilities of our API to ensure it can support this functionality effectively. We will get back to you soon with more information.

Also, it is important to note that API resource embedding queries are intended for use with smaller tables. However, if you use them with large tables, such as RNASeq data, they are likely to fail. This is because transferring large data through a browser can be challenging.

Best,
Pramod

jgarcia · April 14, 2024, 11:34pm

That makes a lot of sense! Right now, we are hitting the API with the following endpoint to grab the data necessary for our task:

/pbmc_gene_expression?versioned_ensembl_gene_id=eq.ENSG00000277632.1

Right now, splitting the subject tables by year is easy given that they exist as table columns. I think that’s good enough since we can use the values of the following query to split the data.

subject_endpoint = f'{base_url}/subject?dataset=eq.{year}_dataset'
response = requests.get(subject_endpoint, headers=base_headers)
data = StringIO(response.text)

Essentially, we can find what specimens/subjects belong to what year, and use those values to split everything nicely for us on our end using Python.

Pramod · April 16, 2024, 9:45pm

Hi @jgarcia,

We updated API endpoints, and now they support resource embedding for join queries. Here is a screenshot of the query you posted.

Feel free to utilize this functionality, and let us know if you need additional support.

jgarcia · April 17, 2024, 7:38am

Wow great! I got away with writing something to do that on our end since I wasn’t sure when you folks would get to that! here’s something I wrote up that attempted to do something similar on our end:

github.com

brianrqian/DSE260A-Immune-Response/blob/main/api_requests.py

import requests
import csv
from io import StringIO
import pandas as pd
import os

api_version = "4_1"  # latest version, should be updated as API changes
base_url = f"https://www.cmi-pb.org/api/v{api_version}/"
base_headers = {
    "Accept": "text/csv",
}


def create_subdirectory(subdirectory="data"):
    """Create a subdirectory if it doesn't exist."""
    directory_path = os.path.join(os.getcwd(), subdirectory)

    if not os.path.exists(directory_path):
        os.makedirs(directory_path)
        print(f"Created directory: {directory_path}")

If this is baked in directly on your end, I would prefer to rely on that from you folks

Thanks for communicating that out.

jgarcia · April 25, 2024, 1:31am

Thanks for your support! My team was curious when the non-day zero values for 2022 would be released. I figured that’s part of the next prediction challenge, but we just wanted to verify that! Thanks again for your responses!

Pramod · May 2, 2024, 6:34pm

@jgarcia longitudinal data for 2022 were already made available. You can access it via API ( /v4_1/). Please let me know if you come across any issues. Also, let me know if you are looking into specific datatype/assay?