The SIMS Backchannel Public Dataset

PRERELEASE DATA!
PLEASE DO NOT REDISTRIBUTE OR SHARE.

About

Public release of the dataset used in for log file analysis in "Social Uses of Communication Backchannels in a Shared Physical Space" by Rothenberg and King (2006).

Licensing

Pending legal advisement. Intention is free for academic usage with attribution.

Database Files

public_logs

Description: Anonymized chat logs for the SIMS Backchannel from October 2004 to March 2006.
Version: 1.1
Size: 276,491 records.
Download:
  • [SQL], SQL format, 7.2MB download, 46MB uncompressed
  • [CSV], CSV format, 6.9MB download, 28MB uncompressed
  • [XML], XML format, 7.2MB download, 111MB uncompressed
    field type comment
    id INT ID for log entry (sequential)
    ts DATETIME timestamp
    uuid INT unique userid of user for entry
    msg_hash VARBINARY(40) a salted SHA1 hash of the original message
    msg_length INT length of the original message (in characters)
    bot_cmd BOOL did msg begin with a "@" symbol?
    ends_question BOOL did msg end with a "?" symbol?
    ends_exclamation BOOL did msg end with a "!" symbol?
    ends_period BOOL did msg end with a "." symbol?
    contains_url BOOL did msg contain a "http://" string?

    public_users

    Description: Table of UserIDs built from public_logs, with manual coding added
    Version: 1.0
    Size: 335 records
    Download:
  • [SQL], SQL format, 12KB download, 65KB uncompressed
  • [CSV], CSV format, 11KB download, 40KB uncompressed
  • [XML], XML format, 12KB download, 129KB uncompressed
    field type comment
    uuid INT ID for log entry (sequential)
    uid INT If user is not a primary nick, this field is set equal to the UUID of the primary nick for the account (based upon manual coding)
    nickhash VARBINARY(40) a salted SHA1 hash of the original user nick
    time_first DATETIME timestamp of first msg
    time_last DATETIME timestamp of first msg
    tenure INT the user's tenure (amount of days in between time_first and time_last)
    participation INT amount of msgs sent by this UUID
    isjunk BOOL Was user classified as a "junk" user (inactive) in the R&K study?
    isdupe BOOL Was uuid classified as a "duplicate" user (not primary account) in the R&K study?
    isbot BOOL Was uuid classified as a "bot" in the R&K study?

    Examples

    Som example queries on the dataset to get you started, shown in MySQL.

    How many distinct users in a given day?

    SELECT date(ts), count(distinct p.uuid) AS amount_unique_users
    FROM public_logs p, public_users n 
    WHERE (p.uuid = n.uuid) AND
    isjunk = 0 AND isdupe = 0
    GROUP BY date(ts)
    ORDER BY count(distinct p.uuid) DESC
    LIMIT 5;
    
    +------------+---------------------+
    | date(ts)   | amount_unique_users |
    +------------+---------------------+
    | 2005-10-31 |                  24 |
    | 2005-11-07 |                  23 |
    | 2005-11-14 |                  23 |
    | 2005-10-26 |                  23 |
    | 2006-02-23 |                  22 |
    +------------+---------------------+
    

    On what day of week are the most questions asked?

    SELECT dayname(ts), count(msg_hash)
    FROM public_logs 
    WHERE ends_question = TRUE
    GROUP BY dayofweek(ts);
    
    +-------------+-----------------+
    | dayname(ts) | count(msg_hash) |
    +-------------+-----------------+
    | Sunday      |            3533 |
    | Monday      |            5712 |
    | Tuesday     |            6279 |
    | Wednesday   |            4871 |
    | Thursday    |            5388 |
    | Friday      |            2192 |
    | Saturday    |            2961 |
    +-------------+-----------------+
    

    More complex statistical analysis

    Here's a fancy complicated SQL query that was actually used in the analysis for our paper! This was to answer the question "is there more link sharing during the week when people are in class?"
    SELECT *, (urls / msgs)*100 AS urlpercent FROM
    (
    	SELECT dayname(ts) AS day, 
    	DAYOFWEEK(ts) in (1,7) AS weekend,
    	COUNT(msg_hash) AS msgs,
    	SUM(contains_url) AS urls
    	FROM public_logs
    	GROUP BY dayofweek(ts)
    ) AS sub;
    
    +-----------+---------+-------+------+------------+
    | day       | weekend | msgs  | urls | urlpercent |
    +-----------+---------+-------+------+------------+
    | Sunday    | 1       | 31983 | 424  | 1.3257     |
    | Monday    | 0       | 49634 | 759  | 1.5292     |
    | Tuesday   | 0       | 56182 | 903  | 1.6073     |
    | Wednesday | 0       | 43010 | 655  | 1.5229     |
    | Thursday  | 0       | 47342 | 708  | 1.4955     |
    | Friday    | 0       | 20686 | 299  | 1.4454     |
    | Saturday  | 1       | 27654 | 372  | 1.3452     |
    +-----------+---------+-------+------+------------+
    7 rows in set (1.29 sec)
    
    Is the difference significant? If you import these numbers into a stats package (I highly recommend GNU R), you'll see link sharing is 13.8% greater during the week, and is statistically significant at p =~ 0.001. Here's an example of how to test that in R:
    > mean(urlpercent[weekend==0]) / mean(urlpercent[weekend==1])
    [1] 1.138238
    > t.test(urlpercent[weekend==0], urlpercent[weekend==1])
    
            Welch Two Sample t-test
    
    data:  urlpercent[weekend == 0] and urlpercent[weekend == 1]
    t = 6.5721, df = 4.81, p-value = 0.001418
    alternative hypothesis: true difference in means is not equal to 0
    95 percent confidence interval:
     0.1115354 0.2576846
    sample estimates:
    mean of x mean of y
      1.52006   1.33545