I can think of a few better ways to brush up on scraping data than going through OkCupid. It made for a fun evening and allowed me to brush up on regular expressions (which will be the death of me.)

What I used to grab the okcupid data?


What I should have used to grab (and parse) the data?

Simple PHP DOM Library

When I started this, I realized a bit too late that there was no real reason to create a function just to grab the usernames. OkCupid’s search function makes that incredibly easy:

  1. Go to Browse ->
  2. Edit Get / Post variables in the URL
  3. Rejoice

Shit! OkCupid uses Ajax to show and hide divs on scroll. That’s annoying. How do we eliminate that? Oh yah, we turn off javascript.







How old are we looking for, when were they online, what are we ordering by and how many results are we going to show. Ta da.

I used to have all the code here written in proper oophp but things got weird. So now the code is re-written into *thought* chunks in procedural php *shudder* — Not as pretty, but easier to follow for people who actually care to run through this rather than simply copy and paste. Just make sure to set your table URL / USERNAME to distinct because we’re getting duplicate values. If you want to just loop through every second line, you can do that too…but whatevs.

$k = 'download1.html'; // obviously set the name of the html file you downloaded
$data = file_get_contents($getcontents); // toss them into the data file
$regex = '/www.okcupid.com\/profile\/(.+?)\?cf=regular/'; // look for the profile URL, it's all we really want
preg_match_all($regex,$data,$match); // match it up.. Toss the URL in one, the Username in the other.
$url = array(); // We don't actually need this, but it's not a bad habit to get into
$username= array(); // and same deal, brah. Same deal.

* all you have to do is loop
* through the matches in the
* code #amirite
foreach ($match[0] as $urls) {
foreach ($match[1] as $usernames) {
$username[] = $usernames;

/* do we really need to loop this twice? Nope. But when you're teaching someone how to go through this
* it's better to show them code they can understand.. So we're looping it twice
* below is the same deal. Realistically this should all just be one function, but
* its much easier to follow and teach when you just rip it apart
for($i=0;$i<count($url);$i++) { 
// Just toss it in...easy peezy
$query = "INSERT INTO okcupidlinks (url, username) VALUES ('".$url[$i]."', '".$username[$i]."')";
$retval = mysqli_query($query, $conn); // we called our connection $conn btw

Create another database with USERNAME, URL and large text for the data and you’re good to go:

$username = "blahblah"; //your okcupid username
$password = "blahblah"; // your okcupid password
$postinfo = "username=".$username."&password=".$password;
/* I have it with cookies.. you don't need cookies so it's been removed. If for some reason
you can't figure it out, I'll add the cookies back. 
$ch = curl_init();
curl_setopt($ch, CURLOPT_HEADER, false);
curl_setopt($ch, CURLOPT_NOBODY, false);
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_SSL_VERIFYHOST, 0);
curl_setopt($ch, CURLOPT_USERAGENT, "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/32.0.1700.107 Safari/537.36");
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_SSL_VERIFYPEER, 0);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 0);
curl_setopt($ch, CURLOPT_CUSTOMREQUEST, "POST");
curl_setopt($ch, CURLOPT_POST, 1);
curl_setopt($ch, CURLOPT_POSTFIELDS, $postinfo);

/* database */
$dbhost = 'localhost:3036';
$dbuser = 'blah';
$dbpass = 'blah';
$conn= mysqli_connect($dbhost, $dbuser, $dbpass);
/* however you choose to handle a failed connection, you can do so here */
/* my script grabs age and a few other things that are easy to pull out.. I haven't included them here */
$result = mysql_query("SELECT urls, username FROM okcupidlinks ORDER BY urls DESC");
/* again.. however you choose to handle a fucked up result.. throw it here */

while ($row = mysqli_fetch_array($result, MYSQL_NUM)) {
/* i don't care how you fetch it...just fetch it Maybe ASSOC is better in this case, but if you have
* an array with two elements URL and USERNAME just stick to numbers. */
curl_setopt($ch, CURLOPT_URL, trim($row[0]));
$html = curl_exec($ch); // give 'er
$prettyhtml = mysqli_real_escape_string($html);
$query = "INSERT INTO okcupidlinks2 ( urls, username,
hugehtml) VALUES ('".$row[0]."', '".$row[1]."', '".$prettyhtml."')";
$retval = mysqli_query($query, $conn);
sleep(3); // take a quick snooze, brah

Free the result, and close CURL close_curl() is the command, believe it or not.

So that’s the basics on how to crawl OkCupid. You really don’t need to use a database, and if you can’t figure out how to merge these two chunks together so that it just does it on the fly, just comment or something. I haven’t written procedural PHP in a while, but it’s infinitely easier to follow for the layman. I almost certainly left out some semi colons and what not while re-writing this, so if it bugs up, it shouldn’t be that hard to figure out.

This is just a jumping off point. Obviously using regex and DOM, it’s easy to grab basically any data that you want and store it in a database. Originally that was posted here, but I do not need any more weird emails.

This just visits their profile, and basically invites them to send you a message.

Just make sure that no one on okCupid has the username “=2;  DROP TABLE OKCUPIDLINKS2” or just use prepared statements.


2 Responses

  1. Kevin

    Hey thank you for your tutorial. I am trying to login on Okcupid using cURL with your code but I keep getting denied and get the signup page as a result after submiting the login form.
    Do you have any ideas for me?

    Best regards,

    • kris

      OkCupid is notorious for changing little things all of the time. I’ll take a look, but as far as I remember, you just grab the cookie and go from there. If you’re having trouble grabbing the cookie, there’s a browser plugin that’ll grab one for you.


Leave a Reply

Your email address will not be published.