TL;DR

In our quest to fetch a whole thread of tweets via the Twitter API, this stop we get all tweets following a specific one.

You already know it: we know how to Scrape a Thread of Tweets but we want to do the same using the Twitter API. It’s not difficult but also not totally straightforward, so let’s start with a simpler problem: getting all tweets that were posted by a specific user starting from one of our choice.

We start with some boilerplate

Well, we can just copy some of our latest post about using MojoX::Twitter and adapt it a bit for starters:

#!/usr/bin/env perl
use 5.024;
use warnings;
use experimental qw< postderef signatures >;
no warnings qw< experimental::postderef experimental::signatures >;

use Mojo::JSON 'j';
use Mojo::File 'path';
use MojoX::Twitter;

my $id = shift // '1215710451343904768';

my $credentials = j path('twitter-credentials.json')->slurp;
my $client      = MojoX::Twitter->new(
   consumer_key        => $credentials->{'api-key'},
   consumer_secret     => $credentials->{'api-secret-key'},
   access_token        => $credentials->{'access-token'},
   access_token_secret => $credentials->{'access-token-secret'},
);
my $tweets = get_tweets_since($client, $id);
say j $tweets;

sub get_tweets_since ($client, $id) {...}

It’s pretty basic: the starting tweet’s id can be optionally provided on the command line, defaulting to the one that got all of this started (see Scrape a Thread of Tweets).

The client object is created exactly as in Getting started with MojoX::Twitter, although this time we delegate to a sub (get_tweets_since) the job of retrieving all tweets since that specific identifier, including the very tweet. After we get this, we turn it into a JSON string (via function j from Mojo::JSON) and print it (via say).

Getting all tweets

The Twitter API has a (remote) method to get all tweets in a user’s timeline, i.e. [GET statuses/user_timeline][statueses-user_timeline]. How to work properly with timelines is explained in Get Tweet timelines, which is an interesting reading.

The bottom line is that:

  • you start getting the latest tweets in the user’s timeline, going backwards in time and getting at most 200 tweets per request;
  • to go back in time, you provide parameters to put a boundary;
  • when you hit the identifier of the original tweet you stop.

The two key parameters to do this windowing are since_id and max_id.

The former is probably the easier to understand: since_id tells the API to only include tweets that came strictly after the specific identifier. In our quest for a thread this is good, because for sure there are no interesting tweets in a thread before the initial tweet!

The max_id parameter requires some care. As we anticipated, the Twitter API works backwards, so we can use max_id to set an upper boundary to the tweets we are interested into (i.e. we don’t want anything strictly after max_id).

Let’s see some code:

 1 sub get_tweets_since ($client, $id) {
 2    my $tweet = $client->request(    # needed to get the user
 3       GET => "statuses/show/$id",
 4       {tweet_mode => 'extended'}
 5    );
 6    my @tweets;
 7    my %options = (
 8       user_id    => $tweet->{user}{id},
 9       since_id   => $id,
10       count      => 200,                  # max value possible
11       tweet_mode => 'extended',
12    );
13    while ('necessary') {
14       my $chunk =
15       $client->request(GET => 'statuses/user_timeline', \%options);
16       my @chunk = sort { $a->{id} <=> $b->{id} } $chunk->@*;
17       pop @chunk if exists $options{max_id};    # remove duplicate
18       last unless @chunk;                       # no more available
19       $options{max_id} = $chunk[0]{id};         # remark for next iteration
20       unshift @tweets, @chunk;                  # older ones in front
21    } ## end while ('necessary')
22    unshift @tweets, $tweet;                     # the starting one...
23    return \@tweets;
24 } ## end sub get_tweets_since

First of all, we have to get the initial tweet. This is necessary because this also allow us to fetch the specific user of the tweet and peruse the user_timeline of associated to the user’s identifier.

Hash %options contains all parameters that we will pass in our successive calls to the statuses/user_timeline endpoint. The first iteration we are only setting since_id (i.e. the lower bound) but not max_id, which means that we will get the most recent available tweets.

The result of the call is an anonymous array that we “store” as $chunk. Immediately after we unroll it into array @chunk, sorting by id on the fly. It’s not entirely clear whether this sorting is really needed or not, let’s just do this to be on the safe side.

One tricky thing about max_id is that the tweet with max_id is included in the result. We get it from the lowest identifier found in the iteration (line 19), so if %options contains it then it would be a duplicate. This accounts for line 17 where we pop it away, which happens only starting from the second iteration, because $options{max_id} is not set when the test in line 17 is performed during the first iteration.

If @chunks remains empty after removing the duplicate tweet, then our backwards iteration has come to an end and we can stop the loop (line 18).

Tweets in @chunk are put in the overall array @tweets considering that they are ordered and are also received from the most recent to the older one. For this reason we use unshift in line 20.

Last, remember that since_id selects only tweets strictly after it? If we are interested into that tweet, then, we have to add it explicitly as the first item in @tweets, which we do in line 23.

Putting it all together

The following snippet contains the whole code:

There is also a local version if the above snippet from GitLab is not working.

If you have a comment please leave it below, until next time happy hacking!