SensorKit: careful with duplicates in device usage data...

Daniel J Wilson
  • Edited

Just a general warning that there are lots of duplicated rows in the device usage data.

Usually what is happening is that the same `.json` file is being uploaded on multiple days, so as you iterate through when ingesting the data unless you check for this then you will (at least we have) definitely get lots of duplicates. 

This is normally not an issue as you can just remove duplicate rows using a pandas function, but we have also noticed that there are "weird" duplicate rows which you will also need to catch unless you want to incorrectly inflate your data.

For example, if you look at the data below 👇 you can see that something weird is happening with the `date_time` stamp...look carefully

In row 131 the time is 18:44:59, and then in the row 132 the time is 18:45:00. The duration for both is given as 900 seconds, but in fact 899 of these seconds are necessarily overlapping as the two samples are only 1 second apart. You can also see this in the data where they have identical `total_unlock_duration`.

Standard duplicate detection will not work on this as the rows DO have different `date_time` values.

Also specifying which columns to check for duplicates on while ignoring `date_time` is not a good solution as you have many samples where the values are all zeros, so they are not actually duplicates, and there is also a non-zero chance even when you have positive values that they could be the same so you would be incorrectly removing "real" data.

Our workaround for the moment is to sort the dataframe by Participant_ID and date_time and then to use the .diff method to create a new column that is effectively checking the time difference between samples. If there are values in this column that are below 900 seconds (the standard interval) then something is probably going on.

Doing a value_count() of this column we got:

So the vast majority were 900 seconds but there were a significant number of 899 and 901 second diffs, which are not a big deal, but also almost 2000 one second diffs which need to be removed. The big values are when it is diff-ing between one participant and the next.

To then remove these values you can choose a minimum value of seconds (in this case I chose 30 but the difference between 30 and 400, for example, is only six samples as almost all the heavy lifting is being done by the 1 second value) and extract all of the relevant indices for removal (time_diff is the name of my column which I converted to seconds from a timedelta object):

Hopefully this might be helpful for someone - and if anyone has a better solution I am definitely all ears!

Also just a thought for the forum developers that it would be great if we could write in markdown here :) 




Please sign in to leave a comment.