SensorKit Ingestion & Processing

Basyl Durnan

How have you been ingesting and processing SensorKit data? Do you have any advice for people just approaching SensorKit data for the first time?

Please remember our community guidelines:

  1. All posts and comments must be constructive. We ask that you respect and assist one another and always add your thoughts with the goal of improving the product and/or community.

  2. All posts and comments must be relevant. Posts that are not related to MyDataHelps or healthcare may be removed.
1

Comments

4 comments

  • Comment author
    Yfang

    Hi,

    Here's the some feedback (regarding the data formatting) from HITS team who is taking care of the data intake pipeline:

    To ease our implementation and integrate the SensorKit data ingestions with our established data pipeline, we would ask your helps on the SensorKit data format:

    1. The SensorKit JSON files are currently in .gz format after unzipping the daily zipped file (*.zip). Could you just include the original JSON files like you do for all those Fitbit intraday data files, i.e., without GNU zipping for each single SensorKit JSON file?
    2. If the item 1 is not possible, could you prefix each JSON.gz data file with the participant’s MDH ID? In that way, I could extract a participant’s MDH ID from a data filename after unzipping all the SensorKit GZ files for each type of data. Then, we could combine the participant’s MDH ID with other data elements extracted from the JSON file and form the tabular data for data analysis.

    Thanks,

    Intern Health Study team

    0
  • Comment author
    Basyl Durnan

    Hi Yu- thanks for sharing. How has your team been ingesting the SensorKit data to date? Is that still being explored?

    0
  • Comment author
    Yfang

    Hi - We have been able to explore the available metrics by reading individual json files and the help documents from Apple (thanks for the link!). 

    However resolving the above formatting issue is essential for batch processing data in the future. Thanks!

    0
  • Comment author
    Eric Schramm

    The SensorKit data is packaged in a pretty raw format and for some of the more "data verbose" types, the data is chunked into a series of files. It can be a daunting task to manually work with this data simply due to the volume and arrangement. Here's an example of a Python script that given a path (setting the data_path variable inside) to an unzipped MyDataHelps export directory, will loop thru the files, un-gzipping them and combining them by participant identifier into one file per participant that is a large JSON object with the data nested by sensor type, device type, device identifier, and query interval inside it.

    import os
    import gzip
    import json

    # When pointed at the root of an export, this script will ungzip each file and 
    # append into a data structure for each participant roughly like this, with a 
    # file per participant named as {participant_id}.json in the data_path directory.
    #
    # {
    #   {sensor_type}:
    #     {device_type}:
    #       {device_identifier}:
    #         {
    #            {query_interval_file_1}: {contents of file 1...},
    #            {query_interval_file_2}: {contents of file 1...},
    #         }
    #       },
    #     },
    #   },
    # }

    # Set the path to the data export folder
    data_path = "/downloads/MyDataHelps/2022-12/"

    # Create a dictionary to store the extracted data by participant ID
    data_by_participant = {}

    # Iterate over all files in the data export folder
    for root, dirs, files in os.walk(data_path):
        for file in files:

            # Check if the file is a gzip file
            if file.endswith(".gz"):

                path_components = root.split("/")

                # Get participantID
                participant_ID = path_components[-2]
                participant_data =  {}
        
                # Check if participant data dictionary already exists to append
                if participant_ID in data_by_participant:
                    participant_data = data_by_participant[participant_ID]
                else:
                    data_by_participant[participant_ID] = participant_data
                
                # Get sample type, device, device identifier, query interval
                device_identifier = path_components[-1]
                device_type = path_components[-3]
                sample_type = path_components[-4]
                query_interval = file.split(".")[0]

                # Check if sample type dictionary already exists to append
                sample_type_data = {}
                if sample_type in participant_data:
                    sample_type_data = participant_data[sample_type]
                else:
                    participant_data[sample_type] = sample_type_data
        
                # Check if device type dictionary already exists to append
                device_type_data = {}
                if device_type in sample_type_data:
                    device_type_data = sample_type_data[device_type]
                else:
                    sample_type_data[device_type] = device_type_data
        
                # Check if device identifier array already exists to append
                device_identifier_data = {}
                if device_identifier in device_type_data:
                    device_identifier_data = device_type_data[device_identifier]
                else:
                    device_type_data[device_identifier] = device_identifier_data
        
                # Open the gzip file and extract the data
                with gzip.open(os.path.join(root, file), "rb") as gzip_file:
                    data = json.loads(gzip_file.read())
                    device_identifier_data[query_interval] = data
                    print("processing " + sample_type + " data for " + participant_ID + " - device: " + device_type + ", identifier: " + device_identifier + "; query range: " + query_interval)

        # loop by participant and create JSON files

    for participant_ID in data_by_participant:
        print("saving data for " + participant_ID)
        participant_data = data_by_participant[participant_ID]
        file_name = data_path + participant_ID + ".json"
        with open(file_name, "w") as outfile:
            json.dump(participant_data, outfile)
    1

Please sign in to leave a comment.