angle-uparrow-clockwisearrow-counterclockwisearrow-down-uparrow-leftatcalendarcard-listchatcheckenvelopefolderhouseinfo-circlepencilpeoplepersonperson-fillperson-plusphoneplusquestion-circlesearchtagtrashx

X web automation and scraping with Selenium

For small Selenium projects we do not prevent automation detection but try to mimic human behaviour.

11 August 2023 Updated 11 August 2023
post main image
https://unsplash.com/@nixcreative

When you want to scrape data from the Web, you have to know what you're doing. You don't want to overload a target server with requests. If you do this from a single location, an IP address, you could get a (temporary) ban.

If you want to scrape big, consider using a dedicated service such as ZenRows, ScrapFly, WebScrapingAPI, ScrapingAnt, etc. They distribute your requests across a lot of systems, each with a unique IP address, which means the target server will think it's being accessed by many different (human) clients.

But hey, sometimes we just want to scrape a little bit of data, let's say we want to scrape a few people's new posts on a social media platform like X every day.

In this post, we'll use Selenium. There are solutions like Puppeteer or Playwright, but Selenium has been around since 2004, it's the most widely used tool for automated testing and has a large community.

I will start with some considerations and end with some working code. The code automates logging into X, and then searching and scrapping some messages. I know everything can change tomorrow, but today this works.

As always I am running Ubuntu 22.04.

Scraping from X with Selenium

Recently X decided that you must log in to view posts. I do not like this. Most of the time I am only reading posts of a few people, using the browser extension 'Breakthrough Twitter Login Wall'. This does not work anymore.

And sometimes I scrape some data, but not hundreds of thousands of tweets. I think many developers do this. There are almost 30 million software engineers in the world. Now let us assume that one in thousand sometimes scrape some X posts once every year, who scrape an average of 1000 posts each, for small projects, testing and fun (learning). That is 30.000 scrapers per year, 100 scrapers per day, is 100.000 posts per day. This is not really much considering the millions of posts every day. X still has a free API account, but this is write-only. If we want to scrape data, we can only use the X website and access it with tools like Selenium.

Selenium is a very powerful tool for automated testing but it also can be used for web scraping. It was released in 2004 and has a large community. Other popular tools you might want to consider are Puppeteer and Playwright.

Mimicking Human Behavior

I would say it is not illegal to scrape a web page as long as you do not release the scraped data in a recognizable form. That said, I can imagine some companies blocking access to their web pages and services for automated software.

What exactly is automated software and how can it be detected? The most obvious check is to determine if a visitor's behavior differs from human behavior. This usually means that your automated software should behave like a human. Type something in and wait a few seconds, click a button and wait a few seconds, scroll down and wait a few seconds, etc. A human cannot fill in a form in 50 milliseconds ...

Consider another scenario, where Selenium is used to open pages of a Web site using voice control. Even if the website detects that we are using Selenium, this seems perfectly legitimate because it is very clear that the actions are performed by a human.

Walking a website and buying a cheap product is human behavior, but buying all cheap products in large quantities is suspicious. Especially if this happens day after day.

This means that the main thing our software should do is mimic human behavior. Do something and wait a random number of seconds, do something else and wait a random number of seconds, etc.

Browser selection

Selenium supports many browsers. For this project, I am using Chrome. The reason is a no-brainer: Chrome is the most used browser, not only by internet users, but also by developers. On my developer machine, I am using Firefox, Chromium, and (sometimes) Brave. This means I can use Chrome exclusively for this project. If you are using Chrome also for web browsing, then you can create a separate profile for the scraping application.

Scraper bot detection

A scraper bot is a tool or piece of code used to extract data from web pages. Web scraping is not new and many organizations have taken measures to prevent automated tools from accessing their websites. The article 'How to Avoid Bot Detection with Selenium', see links below, lists several ways to avoid bot detection.

If we do nothing, then it is very easy to detect our Selenium bot. As long as we behave like a human, this may not be bad. We show the target website that we are either script kiddies or that we want them notice that we are using some automation.

Still, some companies may block us for their own legit reasons. Let's face it, a good bot suddenly can become a very bad bot.

There are some sites we can use to determine if our bot is recognized as automated software:

  • https://fingerprint.com/products/bot-detection
  • https://pixelscan.net
  • https://nowsecure.nl/#relax

Here is some sample code you can use to try this:

# bot_detect_test.py
import os
import time

# selenium
from selenium import webdriver
# chromedriver_binary, see pypi.org
import chromedriver_binary

# configure webdriver
from selenium.webdriver.chrome.options import Options
webdriver_options = Options()
# use a common screen format
webdriver_options.add_argument('window-size=1920,1080')
# profile
user_data_dir = os.path.join(os.getcwd(), 'selenium_data')
webdriver_options.add_argument('user-data-dir=' + user_data_dir)
# allow popups
webdriver_options.add_experimental_option('excludeSwitches', ['disable-popup-blocking']);


def main():
    with webdriver.Chrome(options=webdriver_options) as wd:
        wd.get('https://fingerprint.com/products/bot-detection')
        time.sleep(8)
        wd.save_screenshot('fingerprint.png')
    print(f'ready')

if __name__ == '__main__':
    main()

Will we be blocked?

The more we try to prevent detection of our Selenium bot,  the more suspicious we may look to bot detectors.

There is a project 'undetected_chromedriver', see links below, that tries to deliver a bot, based on Selenium, that cannot be detected. But if it is detected, then the target website may decide to block you because ... why would someone want to be undetectable?

Companies developing bot detectors or offering bot detector services, are closely following projects like 'undetected_chromedriver' and try to beat any detection prevention methods that are implemented.

Updates also are a problem. A new version of Selenium or Chrome may break the detection prevention measures. But not updating Chrome for a long time may also be detectable and suspicious.

So what should we do? Nothing, a little bit, or use for example 'undetected_chromedriver'? If you want to scrape big then the best way is to use a scraping service as the ones mentioned above. For small projects I suggest you do nothing to prevent detection and see what happens.

Do we really need to automate login?

Instead of automating login and logging in via our script, we can also log in manually using the browser, and then use the data from the browser (profile data, session) when connecting to X. This saves a lot of code. But ... our application is not fully automated in this case. And hey, we are using Selenium, the web automation (!) tool.

About the code

The code below automates login to X and then scrapes some posts of a selected user. Once this is done, the script exits. It uses a separate profile for the web browser meaning it should not interfere with normal browser usage. If login was successful and the script exists, then the next time the script is started it does not login anymore but immediately starts selecting the user and scraping the posts.

The posts of the selected user are dumped in a file 'posts.txt'. This means the code does not extract the parts of the posts. You may want to use BeautifulSoup for this.

Recognizing pages

Our script deals with the following pages:

When not logged in:

  • Home page
  • Login - enter email
  • Login - enter password
  • Login - enter username (on unusual activity)

When logged in:

  • Logged in page

The 'Unusual activity page' is shown sometimes, for example when you already are logged in via another browser, before the 'enter password' page.

What we do in a loop:

  • Extract the language
  • Detect which page is shown
  • Perform the action for the page

When on the home page, we simply redirect to the login page. All other not logged in pages have a title, a 'form field' and a 'form button'. This means we can handle them in a similar way.

And when talking about pages, we are not talking about URLs, but what is shown on the screen. X is all Javascript.

The 'which page are we on' detector

This is probably the most 'exciting' piece of code. We first create a list of mutually exclusive 'presence_of_element_located' items:

  • Profile-button => logged in
  • Log in button with text 'Sign in' => 'home_page'
  • <h1> text 'Sign in to X' => 'login_page_1_email'
  • <h1> text 'Enter your password' => 'login_page_2_password'
  • <h1> text 'Enter your phone number or username' => 'login_page_3_username'

Then we wait for at least one of the elements to become located:

    wait = WebDriverWait(
        ....
    )
    elem = wait.until(EC.any_of(*tuple(or_conditions)))

Once we have the element, we can determine the page. And once we have the page, we can run the appropriate action.

Oops: 'Something went wrong. Try reloading.'

Upon log in, this message appears sometimes. I see it at random times, so I wonder if it has been build in on purpose. The code generates a timeout error when this happens. You need to restart the script or implement your own retry function.

Languages

There are two languages we must deal with, the not-logged-in language and the account language. Often they are the same. The code below runs for the languages English and Dutch, even mixed. The language is extracted from the page.

To add a new language, add it to:

    # languages
    self.available_langs = ['en', 'nl']

and add the texts to the pages:

    # pages
    self.pages = [
        ...
    ]

Of course you first must look up the texts by using manual log in.

To start the browser in another language, first remove the browser profile directory:

rm -R browser_profile

and then start the script after (!) setting the new locale:

bash -c 'LANGUAGE=nl_NL.UTF-8 python x_get_posts.py'

XPath searching

If you want to search from the root of document, start XPath with '//'.
If you want to search relative to a particular element, start XPath with './/'.

Posts

At the moment the visible posts are written as HTML into the file 'posts.txt'. To get more posts, implement your own scroll-down function.

Extracting data from post

This is up to you. I suggest you use BeautifulSoup.

The code

In case you want to try, here is the code. Before running, make sure you add your account data and search name.

# x_get_posts.py
import logging
import os
import random
import sys
import time

# selenium
from selenium import webdriver
# using chromedriver_binary
import chromedriver_binary  # Adds chromedriver binary to path
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.action_chains import ActionChains
import selenium.common.exceptions as selenium_exceptions

from selenium.webdriver.chrome.options import Options

# configure webdriver: hide gui, window size, full-screen
webdriver_options = Options()
#webdriver_options.add_argument('--headless')
# use a common screen format
webdriver_options.add_argument('window-size=1920,1080')
# use a separate profile
user_data_dir = os.path.join(os.getcwd(), 'browser_profile')
webdriver_options.add_argument('user-data-dir=' + user_data_dir)
# allow popups
webdriver_options.add_experimental_option('excludeSwitches', ['disable-popup-blocking']);
# force language
webdriver_options.add_argument('--lang=nl-NL');

def get_logger(
    console_log_level=logging.DEBUG,
    file_log_level=logging.DEBUG,
    log_file=os.path.splitext(__file__)[0] + '.log',
):
    logger_format = '%(asctime)s %(levelname)s [%(filename)-30s%(funcName)30s():%(lineno)03s] %(message)s'
    logger = logging.getLogger(__name__)
    logger.setLevel(logging.DEBUG)
    if console_log_level:
        # console
        console_handler = logging.StreamHandler(sys.stdout)
        console_handler.setLevel(console_log_level)
        console_handler.setFormatter(logging.Formatter(logger_format))
        logger.addHandler(console_handler)
    if file_log_level:
        # file
        file_handler = logging.FileHandler(log_file)
        file_handler.setLevel(file_log_level)
        file_handler.setFormatter(logging.Formatter(logger_format))
        logger.addHandler(file_handler)
    return logger

logger = get_logger(
    file_log_level=None,
)
logger.debug('START')
logger.debug(f'user_data_dir = {user_data_dir}')


class SeleniumBase:
    def __init__(
        self,
        logger=None,
        wd=None,
    ):
        self.logger = logger
        self.wd = wd

        self.web_driver_wait_timeout = 10
        self.web_driver_wait_poll_frequency = 1
        self.min_wait_for_next_page = 3.
        
    def get_elem_attrs(self, elem, dump=False, dump_header=None):
        elem_attrs = self.wd.execute_script('var items = {}; for (index = 0; index < arguments[0].attributes.length; ++index) { items[arguments[0].attributes[index].name] = arguments[0].attributes[index].value }; return items;', elem)
        if dump:
            if dump_header is not None:
                self.logger.debug(f'{dump_header}')
            for k, v in elem_attrs.items():
                self.logger.debug(f'attrs {k}: {v}')
        return elem_attrs

    def random_wait(self, min_wait_secs=None):
        self.logger.debug(f'(min_wait_secs = {min_wait_secs})')
        min_wait_secs = min_wait_secs or 2.
        wait_secs = min_wait_secs + 2 * random.uniform(0, 1)
        self.logger.debug(f'wait_secs = {wait_secs:.1f}')
        time.sleep(wait_secs)

    def elem_click_wait(self, elem, before_min_wait=None, after_min_wait=None):
        self.logger.debug(f'(, before_min_wait = {before_min_wait}, after_min_wait = {after_min_wait})')
        if before_min_wait is not None:
            self.random_wait(min_wait_secs=before_min_wait)
        elem.click()
        if after_min_wait is not None:
            self.random_wait(min_wait_secs=after_min_wait)

    def elem_send_keys_wait(self, elem, value, before_min_wait=None, after_min_wait=None):
        self.logger.debug(f'(, value = {value}, before_min_wait = {before_min_wait}, after_min_wait = {after_min_wait})')
        if before_min_wait is not None:
            self.random_wait(min_wait_secs=before_min_wait)
        elem.send_keys(value)
        if after_min_wait is not None:
            self.random_wait(min_wait_secs=after_min_wait)

    def set_form_field_value_and_click(self, form_field_elem=None, form_field_value=None, form_button_elem=None):
        self.logger.debug(f'(form_field_elem = {form_field_elem}, form_field_value = {form_field_value}, form_button_elem = {form_button_elem})')
        if form_field_elem is not None and form_field_value is not None:
            self.random_wait(min_wait_secs=1.)
            form_field_elem.send_keys(form_field_value)
        if form_button_elem is not None:
            self.random_wait(min_wait_secs=.5)
            form_button_elem.click()
            self.random_wait(min_wait_secs=self.min_wait_for_next_page)

    def wait_for_element(
        self, 
        xpath,
        parent_xpath=None,
        parent_attr_check=None,
        visible=True,
    ):
        self.logger.debug(f'xpath = {xpath}, parent_xpath = {parent_xpath}, parent_attr_check = {parent_attr_check}, visible = {visible}')

        wait = WebDriverWait(
            self.wd,
            timeout=self.web_driver_wait_timeout,
            poll_frequency=self.web_driver_wait_poll_frequency,
            ignored_exceptions=[
                selenium_exceptions.ElementNotVisibleException,
                selenium_exceptions.ElementNotSelectableException,
            ],
        )
        try:
            if visible:
                elem = wait.until(EC.visibility_of_element_located((By.XPATH, xpath)))
            else:
                elem = wait.until(EC.presence_of_element_located((By.XPATH, xpath)))
        except selenium_exceptions.TimeoutException as e:
            self.logger.exception(f'TimeoutException, e = {e}, e.args = {e.args}')
            raise
        except Exception as e:
            self.logger.exception(f'other exception, e = {e}')
            raise

        self.get_elem_attrs(elem, dump=True, dump_header='elem')

        if parent_xpath is None:
            self.logger.debug(f'elem.tag_name = {elem.tag_name}')
            return elem
            
        parent_elem = elem.find_element(By.XPATH, parent_xpath)
        self.get_elem_attrs(parent_elem, dump=True, dump_header='parent_elem')
        if parent_attr_check is not None:
            for k, v_expected in parent_attr_check.items():
                v = parent_elem.get_attribute(k)
                if v != v_expected:
                    raise Exception(f'parent_elem = {parent_elem}, attr k = {k}, v = {v} != v_expected = {v_expected}')
        self.logger.debug(f'parent_elem.tag_name = {parent_elem.tag_name}')
        return parent_elem
    
    def wait_for_elements(self, xpath):
        self.logger.debug(f'xpath = {xpath}')

        wait = WebDriverWait(
            self.wd,
            timeout=self.web_driver_wait_timeout,
            poll_frequency=self.web_driver_wait_poll_frequency,
            ignored_exceptions=[
                selenium_exceptions.ElementNotVisibleException,
                selenium_exceptions.ElementNotSelectableException,
            ],
        )
        try:
            elems = wait.until(EC.presence_of_all_elements_located((By.XPATH, xpath)))
        except selenium_exceptions.TimeoutException as e:
            self.logger.exception(f'TimeoutException, e = {e}, e.args = {e.args}')
            raise
        except Exception as e:
            self.logger.exception(f'other exception, e = {e}')
            raise
        elems_len = len(elems)
        self.logger.debug(f'elems_len = {elems_len}')
        return elems


class XGetPosts(SeleniumBase):
    def __init__(
        self,
        logger=None,
        wd=None,
        account=None,
        search_for=None,
    ):
        super().__init__(
            logger=logger,
            wd=wd,
        )
        self.account = account
        self.search_for = search_for

        # langs are in the <html> tag
        self.available_langs = ['en', 'nl']

        # pages
        self.pages = [
            {
                'name': 'home_page',
                'texts': {
                    'en': {
                        'login_button': 'Sign in',
                    },
                    'nl': {
                        'login_button': 'Inloggen',
                    },
                },
                'page_processor': self.home_page,
                'login_button_text_xpath': '//a[@role="link"]/div/span/span[text()[normalize-space()="' + '{{login_button_text}}' + '"]]'
            },
            {
                'name': 'login_page_1_email',
                'texts': {
                    'en': {
                        'title': 'Sign in to X',
                        'form_button': 'Next',
                    },
                    'nl': {
                        'title': 'Registreren bij X',
                        'form_button': 'Volgende',
                    },
                },
                'page_processor': self.login_page_1_email,
                'form_field_xpath': '//input[@type="text" and @name="text" and @autocomplete="username"]',
                'form_field_value': self.account['email'],
                'form_button_text_xpath': '//div[@role="button"]/div/span/span[text()[normalize-space()="' + '{{form_button_text}}' + '"]]'

            },
            {
                'name': 'login_page_2_password',
                'texts': {
                    'en': {
                        'title': 'Enter your password',
                        'form_button': 'Log in',
                    },
                    'nl': {
                        'title': 'Voer je wachtwoord in',
                        'form_button': 'Inloggen',
                    },
                },
                'page_processor': self.login_page_2_password,
                'form_field_xpath': '//input[@type="password" and @name="password" and @autocomplete="current-password"]',
                'form_field_value': self.account['password'],
                'form_button_text_xpath': '//div[@role="button"]/div/span/span[text()[normalize-space()="' + '{{form_button_text}}' + '"]]',
            },
            {
                'name': 'login_page_3_username',
                'texts': {
                    'en': {
                        'title': 'Enter your phone number or username',
                        'form_button': 'Next',
                    },
                    'nl': {
                        'title': 'Voer je telefoonnummer of gebruikersnaam',
                        'form_button': 'Volgende',
                    },
                },
                'page_processor': self.login_page_3_username,
                'form_field_xpath': '//input[@type="text" and @name="text" and @autocomplete="on"]',
                'form_field_value': self.account['username'],
                'form_button_text_xpath': '//div[@role="button"]/div/span/span[text()[normalize-space()="' + '{{form_button_text}}' + '"]]',
            },
            {
                'name': 'loggedin_page',
                'texts': {
                    'en': {
                    },
                    'nl': {
                    },
                },
                'page_processor': self.loggedin_page,
            },
        ]

        self.page_name_pages = {}
        for page in self.pages:
            self.page_name_pages[page['name']] = page

    def get_page_lang(self, url):
        self.logger.debug(f'(url = {url})')

        # get lang from html tag
        html_tag_xpath = '//html[@dir="ltr"]'
        html_tag_elem = self.wait_for_element(
            xpath=html_tag_xpath,
        )

        # lang must be present and available
        lang = html_tag_elem.get_attribute('lang')
        if lang not in self.available_langs:
            raise Exception(f'lang = {lang} not in  available_langs = {self.available_langs}')
        self.logger.debug(f'lang = {lang}')
        return lang

    def which_page(self, url, lang):
        self.logger.debug(f'(url = {url}, lang = {lang})')

        # construct list of items that identify:
        # - not logged in pages
        # - logged in
        or_conditions = []
        for page in self.pages:
            page_name = page['name']
            if page_name == 'home_page':
                # check for 'Sign in' button
                login_button_text = page['texts'][lang]['login_button']
                xpath = '//a[@href="/login" and @role="link" and @data-testid="loginButton"]/div[@dir="ltr"]/span/span[text()[normalize-space()="' + login_button_text + '"]]'
                self.logger.debug(f'xpath = {xpath}')
                or_conditions.append( EC.visibility_of_element_located((By.XPATH, xpath)) )

            elif page_name == 'login_page_1_email':
                # check for <h1> title
                title_text = page['texts'][lang]['title']
                xpath = '//h1/span/span[text()[normalize-space()="' + title_text + '"]]'
                self.logger.debug(f'xpath = {xpath}')
                or_conditions.append( EC.visibility_of_element_located((By.XPATH, xpath)) )

            elif page_name == 'login_page_2_password':
                # check for <h1> title
                title_text = page['texts'][lang]['title']
                xpath = '//h1/span/span[text()[normalize-space()="' + title_text + '"]]'
                self.logger.debug(f'xpath = {xpath}')
                or_conditions.append( EC.visibility_of_element_located((By.XPATH, xpath)) )

            elif page_name == 'login_page_3_username':
                # check for <h1> title
                title_text = page['texts'][lang]['title']
                xpath = '//h1/span/span[text()[normalize-space()="' + title_text + '"]]'
                self.logger.debug(f'xpath = {xpath}')
                or_conditions.append( EC.visibility_of_element_located((By.XPATH, xpath)) )

        # check if logged in using profile button
        xpath = '//a[@href="/' + self.account['screen_name'] + '" and @role="link" and @data-testid="AppTabBar_Profile_Link"]'
        self.logger.debug(f'xpath = {xpath}')
        or_conditions.append( EC.visibility_of_element_located((By.XPATH, xpath)) )

        # or_conditions
        self.logger.debug(f'or_conditions = {or_conditions}')

        wait = WebDriverWait(
            self.wd,
            timeout=self.web_driver_wait_timeout,
            poll_frequency=self.web_driver_wait_poll_frequency,
            ignored_exceptions=[
                selenium_exceptions.ElementNotVisibleException,
                selenium_exceptions.ElementNotSelectableException,
            ],
        )
        try:
            elem = wait.until(EC.any_of(*tuple(or_conditions)))
        except selenium_exceptions.TimeoutException as e:
            self.logger.exception(f'selenium_exceptions.TimeoutException, e = {e}, e.args = {e.args}')
            raise
        except Exception as e:
            self.logger.exception(f'other exception, e = {e}')
            raise

        # not logged in and a known page, ... or .... logged in
        self.logger.debug(f'elem.tag_name = {elem.tag_name}')
        self.get_elem_attrs(elem, dump=True)

        page = None
        if elem.tag_name == 'a':
            href = elem.get_attribute('href')
            if href == '/login':
                page = self.page_name_pages['home_page']
            else:
                data_testid = elem.get_attribute('data-testid')
                if data_testid == 'AppTabBar_Profile_Link':
                    page = self.page_name_pages['loggedin_page']
        elif elem.tag_name == 'span':
            elem_text = elem.text
            self.logger.debug(f'elem_text = {elem_text}')
            if elem_text == self.page_name_pages['home_page']['texts'][lang]['login_button']:
                page = self.page_name_pages['home_page']
            elif elem_text == self.page_name_pages['login_page_1_email']['texts'][lang]['title']:
                page = self.page_name_pages['login_page_1_email']
            elif elem_text == self.page_name_pages['login_page_2_password']['texts'][lang]['title']:
                page = self.page_name_pages['login_page_2_password']
            elif elem_text == self.page_name_pages['login_page_3_username']['texts'][lang]['title']:
                page = self.page_name_pages['login_page_3_username']
        elif elem.tag_name == 'h1':
            pass
        self.logger.debug(f'page = {page}')
        if page is None:
            raise Exception(f'page is None')
        return page

    def process_page(self, url):
        self.logger.debug(f'(url = {url})')

        lang = self.get_page_lang(url)
        page = self.which_page(url, lang)

        page_processor = page['page_processor']
        if page_processor is None:
            raise Exception(f'page_processor is None')

        return page_processor(page, url, lang)

    def login_page_123_processor(self, page, url, lang):
        self.logger.debug(f'(page = {page}, url = {url}, lang = {lang}')

        # get (optional) form_field and form_button
        form_field_xpath = page.get('form_field_xpath')
        form_field_value = page.get('form_field_value')
        form_field_elem = None
        if form_field_xpath is not None:
            form_field_elem = self.wait_for_element(
                xpath=form_field_xpath,
            )
        form_button_elem = self.wait_for_element(
            xpath=page['form_button_text_xpath'].replace('{{form_button_text}}', page['texts'][lang]['form_button']),
            parent_xpath='../../..',
            parent_attr_check={
                'role': 'button',
            },
        )
        # enter form_field_value and click
        self.set_form_field_value_and_click(
            form_field_elem=form_field_elem, 
            form_field_value=form_field_value,
            form_button_elem=form_button_elem,
        )
        # return current_url
        current_url = self.wd.current_url
        self.logger.debug(f'current_url = {current_url}')
        return current_url

    def home_page(self, page, url, lang):
        self.logger.debug(f'page = {page}, url = {url}, lang = {lang}')

        login_button_elem = self.wait_for_element(
            xpath=page['login_button_text_xpath'].replace('{{login_button_text}}', page['texts'][lang]['login_button']),
            parent_xpath='../../..',
            parent_attr_check={
                'role': 'link',
            },
        )
        href = login_button_elem.get_attribute('href')
        self.logger.debug(f'href = {href}')

        # redirect
        self.wd.get(href)
        self.random_wait(min_wait_secs=self.min_wait_for_next_page)

        # return current_url
        current_url = self.wd.current_url
        self.logger.debug(f'current_url = {current_url}')
        return current_url

    def login_page_1_email(self, page, url, lang):
        return self.login_page_123_processor(page, url, lang=lang)

    def login_page_2_password(self, page, url, lang):
        return self.login_page_123_processor(page, url, lang=lang)

    def login_page_3_username(self, page, url, lang):
        return self.login_page_123_processor(page, url, lang=lang)

    def loggedin_page(self, page, url, lang):
        self.logger.debug(f'page = {page}, url = {url}, lang = {lang}')

        # locate search box
        xpath = '//form[@role="search"]/div/div/div/div/label/div/div/input[@type="text" and @data-testid="SearchBox_Search_Input"]'
        search_field_elem = self.wait_for_element(xpath)

        # type the search item
        self.elem_send_keys_wait(
            search_field_elem,
            self.search_for,
            before_min_wait=2,
            after_min_wait=self.min_wait_for_next_page,
        )

        # locate search result option buttons
        xpath = '//div[@role="listbox" and starts-with(@id, "typeaheadDropdown-")]/div[@role="option" and @data-testid="typeaheadResult"]/div[@role="button"]'
        button_elems = self.wait_for_elements(xpath)

        # find search_for in options
        xpath = './/span[text()[normalize-space()="' + self.search_for + '"]]'
        found = False
        for button_elem in button_elems:
            try:
                elem = button_elem.find_element(By.XPATH, xpath)
                found = True
                break
            except selenium_exceptions.NoSuchElementException:
                continue
        if not found:
            raise Exception(f'search_for = {search_for} not found in typeaheadDropdown')

        # click the found item
        self.elem_click_wait(
            button_elem,
            before_min_wait=2,
            after_min_wait=self.min_wait_for_next_page,
        )

        # slurp posts visibility_of_element_located
        xpath = '//article[@data-testid="tweet"]'
        posts = self.wait_for_elements(xpath)

        # dump posts
        post_html_items = []
        posts_len = len(posts)
        visible_posts_len = 0
        for post in posts:
            self.logger.debug(f'tag = {post.tag_name}')
            self.logger.debug(f'aria-labelledby = {post.get_attribute("aria-labelledby")}')
            self.logger.debug(f'data-testid = {post.get_attribute("data-testid")}')
            self.logger.debug(f'post.is_displayed() = {post.is_displayed()}')
            if not post.is_displayed():
                continue
            visible_posts_len += 1

            # expand to html
            post_html = post.get_attribute('outerHTML')
            post_html_items.append(post_html)

        if posts_len > 0:
            with open('posts.txt', 'w') as fo:
                fo.write('\n'.join(post_html_items))

        self.logger.debug(f'posts_len = {posts_len}')
        self.logger.debug(f'visible_posts_len = {visible_posts_len}')
        self.random_wait(min_wait_secs=5)
        sys.exit()

def main():

    # your account
    account = {
        'email': 'johndoe@example.com',
        'password': 'secret',
        'username': 'johndoe',
        'screen_name': 'JohnDoe',
    }
    # your search
    search_for = '@elonmusk'

    with webdriver.Chrome(options=webdriver_options) as wd:
        x = XGetPosts(
            logger=logger,
            wd=wd,
            account=account,
            search_for=search_for,
        )
        url = 'https://www.x.com'
        wd.get(url)
        x.random_wait(min_wait_secs=x.min_wait_for_next_page)
        url = wd.current_url

        while True:
            logger.debug(f'url = {url}')
            url = x.process_page(url)
        logger.debug(f'ready ')

if __name__ == '__main__':
    main()

Summary

In this post we used Selenium to automate logging in to X and extract a number of posts. To find elements, we use mostly (relative) XPaths. We do not try to hide the fact that we use automation but try to behave as much as a human. The code supports multiple languages. The code works today but may fail tomorrow due to changes in texts and naming.

Links / credits

ChromeDriver - WebDriver for Chrome
https://sites.google.com/a/chromium.org/chromedriver/capabilities

Playwright
https://playwright.dev/python

Puppeteer
https://pptr.dev

ScrapingAnt - Puppeteer vs. Selenium - Which Is Better? + Bonus
https://scrapingant.com/blog/puppeteer-vs-selenium

Selenium with Python
https://selenium-python.readthedocs.io/index.html

Selenium with Python - 5. Waits
https://selenium-python.readthedocs.io/waits.html

Tracking Modified Selenium ChromeDriver
https://datadome.co/bot-management-protection/tracking-modified-selenium-chromedriver

undetected_chromedriver
https://github.com/ultrafunkamsterdam/undetected-chromedriver

WebScrapingAPI
https://www.webscrapingapi.com

WebScrapingAPI
https://www.webscrapingapi.com

ZenRows - How to Avoid Bot Detection with Selenium
https://www.zenrows.com/blog/selenium-avoid-bot-detection

Leave a comment

Comment anonymously or log in to comment.

Comments

Leave a reply

Reply anonymously or log in to reply.