About Ghost.py

This documentation corresponds to the version of Ghost.py in the following branch https://github.com/carrerasrodrigo/Ghost.py manteined by Rodrigo Nicolas Carreras

Ghost.py is originally created by Jean-Philippe Serafin. I made a fork of his branch and I implemented different functionalities since then.

Ghost.py is a Webkit based scriptable web browser for python. It brings you all the power of WebKit with an api in Python.

Installation

First you need to install PyQt or PySide. This will require to install the QT framework first. You can find PyQt in the following link http://www.riverbankcomputing.com/software/pyqt/download

Then you need to install Ghost.py and the dependencies for testing, flask and paste.

Installing Flask and Paste:

pip install Flask
pip install paste

For Ghost.py:

pip install -e git+git://github.com/carrerasrodrigo/Ghost.py.git#egg=Ghost.py

or alternatively:

git clone git://github.com/carrerasrodrigo/Ghost.py.git
cd Ghost.py
python setup.py install

Easy peasy!

Examples

Let’s search some planes on ebay:

from ghost import Ghost

url = "http://www.ebay.com/"
gh = Ghost()

# We create a new page
page, page_name = gh.create_page()

# We load the main page of ebay
page_resource = page.open(url, wait_onload_event=True)

# Full the main bar and click on the search button
page.set_field_value("#gh-ac", "plane")
page.click("#gh-btn")

# Wait for the next page
page.wait_for_selector("#e1-15")

# Save the image of the screen
page.capture_to("plane.png")

Some times we need to scrap a website but we don’t need to download css or js content. This Branch of Ghost.py has many improvements to make your experience faster than ever. Let’s see another example:

from ghost import Ghost

url = "http://news.ycombinator.com/"
# We enable the cache and set the maximun size to 10 MB
# We don't want to load images and load css or js files
gh = Ghost(cache_size=10, download_images=False,
           prevent_download=["css", "js"])

# We create a new page
page, page_name = gh.create_page()

# wait_onload_event will tell to Ghost to leave the open method
# when the On Ready event on the web page has been fired
page_resource = page.open(url, wait_onload_event=False)

# We retrive the links from the web page
links = page.evaluate("""
                        var links = document.querySelectorAll("a");
                        var listRet = [];
                        for (var i=0; i<links.length; i++){
                            listRet.push(links[i].href);
                        }
                        listRet;
                    """)
# Print the links
for l in links[0]:
    print l

Ghost Class

Contents:

class ghost.Ghost(user_agent='Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.2 (KHTML, like Gecko) Chrome/15.0.874.121 Safari/535.2', wait_timeout=20, wait_callback=None, log_level=30, display=False, viewport_size=(800, 600), cache_dir='/tmp/ghost.py', cache_size=0, download_images=True, prevent_download=[], share_cookies=True, share_cache=True)

Ghost manages multiple QWebPage’s.

Parameters:
  • user_agent – The default User-Agent header.
  • wait_timeout – Maximum step duration in second.
  • wait_callback – An optional callable that is periodically executed until Ghost stops waiting.
  • log_level – The optional logging level.
  • display – A boolean that tells ghost to displays UI.
  • viewport_size – A tupple that sets initial viewport size.
  • cache_dir – a directory where Ghost is going to put the cache
  • cache_size – the Size of the cache in MB. If it’s 0 the cache it’s automatically disabled.
  • download_images – Indicate if the browser download or not the images
  • prevent_download – A List of extensions of the files that you want to prevent from downloading
  • share_cookies – A boolean that indicates if every page created has to share the same cookie jar. If False every page will have a different cookie jar
  • share_cache – A boolean that indicates if every page created has to share the same cache directory. If False, cache directory will be called cache_dir + randomint in order to separate the directories.
create_page(wait_timeout=20, wait_callback=None, is_popup=False, max_resource_queued=None)

Create a new GhostWebPage :param wait_timeout: The timeout used when we want to load a new url. :param wait_callback: An optional callable that is periodically executed until Ghost stops waiting. :param is_popup: Indicates if the QWebPage it’s a popup :param max_resource_queued: Indicates witch it’s the max number of resources that can be saved in memory. If None then no limits are applied. If 0 then no resources are kept. If the number it’s > 0 then the number of resources won’t be more than max_resource_queued

exit()

Exits application and relateds.

get_page(index)

Return the indicated GhostWebPage. :param index: Number of the GhostWebPage :return: Returns the page if the index exists, None otherwise

hide()

Close the webview.

remove_page(page)

Destoy the indicated GhostWebPage :param page: The GhostWebPage that we want to destroy

show()

Show current page inside a QWebView.

switch_to_page(index)

Return the indicated page and change the focus. :param index: Number of the GhostWebPage :return: Returns a GhostWebPage if the index exists, None otherwise

GhostWebPage Class:

class ghost.GhostWebPage(app, network_manager, wait_timeout=20, wait_callback=None, viewport_size=(800, 600), user_agent='Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.2 (KHTML, like Gecko) Chrome/15.0.874.121 Safari/535.2', log_level=30, download_images=True, create_page_callback=None, is_popup=False, max_resource_queued=None, *args, **kargs)

Overrides QtWebKit.QWebPage in order to intercept some graphical behaviours like alert(), confirm(). Also intercepts client side console.log().

Parameters:
  • app – a QApplication that it’s running Ghost.
  • network_manager – a NetworkManager instance in charge of managing all the network requests.
  • wait_timeout – Maximum step duration in second.
  • wait_callback – An optional callable that is periodically executed until Ghost stops waiting.
  • viewport_size – A tupple that sets initial viewport size.
  • user_agent – The default User-Agent header.
  • log_level – The optional logging level.
  • download_images – Indicate if the browser download or not the images
  • create_page_callback – A method called when a popup it’s opened
  • is_popup – Boolean who indicate if the page it’s a popup
  • max_resource_queued – Indicates witch it’s the max number of resources that can be saved in memory. If None then no limits are applied. If 0 then no resources are kept/ If the number it’s > 0 then the number of resources won’t be more than max_resource_queued
capture(region=None, selector=None, format=6)

Returns snapshot as QImage.

Parameters:
  • region – An optional tupple containing region as pixel coodinates.
  • selector – A selector targeted the element to crop on.
  • format – The output image format.
capture_to(path, region=None, selector=None, format=6)

Saves snapshot as image.

Parameters:
  • path – The destination path.
  • region – An optional tupple containing region as pixel coodinates.
  • selector – A selector targeted the element to crop on.
  • format – The output image format. The available formats can be found here http://qt-project.org/doc/qt-4.8/qimage.html#Format-enum There is also a “pdf” format that will render the page into a pdf file
click(*args, **kwargs)

Click the targeted element.

Parameters:selector – A CSS3 selector to targeted element.
class confirm(confirm=True, callback=None)

Statement that tells Ghost how to deal with javascript confirm().

Parameters:
  • confirm – A bollean that confirm.
  • callable – A callable that returns a boolean for confirmation.
GhostWebPage.content

Returns main_frame HTML as a string.

GhostWebPage.cookies

Returns all cookies.

GhostWebPage.delete_cookies()

Deletes all cookies.

GhostWebPage.evaluate(*args, **kwargs)

Evaluates script in page frame.

Parameters:script – The script to evaluate.
GhostWebPage.evaluate_js_file(path, encoding='utf-8')

Evaluates javascript file at given path in current frame. Raises native IOException in case of invalid file.

Parameters:
  • path – The path of the file.
  • encoding – The file’s encoding.
GhostWebPage.exists(selector)

Checks if element exists for given selector.

Parameters:string – The element selector.
GhostWebPage.fill(*args, **kwargs)

Fills a form with provided values.

Parameters:
  • selector – A CSS selector to the target form to fill.
  • values – A dict containing the values.
GhostWebPage.fire_on(*args, **kwargs)

Call method on element matching given selector.

Parameters:
  • selector – A CSS selector to the target element.
  • method – The name of the method to fire.
  • expect_loading – Specifies if a page loading is expected.
GhostWebPage.get_current_frame_content()

Returns current frame HTML as a string.

GhostWebPage.global_exists(global_name)

Checks if javascript global exists.

Parameters:global_name – The name of the global.
GhostWebPage.javaScriptAlert(frame, message)

Notifies ghost for alert, then pass.

GhostWebPage.javaScriptConfirm(frame, message)

Checks if ghost is waiting for confirm, then returns the right value.

GhostWebPage.javaScriptConsoleMessage(message, line, source)

Prints client console message in current output stream.

GhostWebPage.javaScriptPrompt(frame, message, defaultValue, result=None)

Checks if ghost is waiting for prompt, then enters the right value.

GhostWebPage.open(address, method='get', headers={}, auth=None, wait_onload_event=True, wait_for_loading=True)

Opens a web page.

Parameters:
  • address – The resource URL.
  • method – The Http method.
  • headers – An optional dict of extra request hearders.
  • auth – An optional tupple of HTTP auth (username, password).
  • wait_onload_event – If it’s set to True waits until the OnLoad event from the main page is fired. Otherwise wait until the Dom is ready.
  • wait_for_loading – If True waits until the page is Loaded. Note that wait_onload_event isn’t valid if wait_for_loading is False.
Returns:

Page resource, All loaded resources.

class GhostWebPage.prompt(value='', callback=None)

Statement that tells Ghost how to deal with javascript prompt().

Parameters:
  • value – A string value to fill in prompt.
  • callback – A callable that returns the value to fill in.
GhostWebPage.region_for_selector(*args, **kwargs)

Returns frame region for given selector as tupple.

Parameters:selector – The targeted element.
GhostWebPage.set_field_value(*args, **kwargs)

Sets the value of the field matched by given selector.

Parameters:
  • selector – A CSS selector that target the field.
  • value – The value to fill in.
  • blur – An optional boolean that force blur when filled in.
GhostWebPage.set_viewport_size(width, height)

Sets the page viewport size.

Parameters:
  • width – An integer that sets width pixel count.
  • height – An integer that sets height pixel count.
GhostWebPage.switch_to_frame(frameName=None)

Change the focus to the indicated frame

Parameters:frameName – The name of the frame
GhostWebPage.switch_to_frame_nro(nro=-1)

Change the focus to the indicated frame

Parameters:nro – Number of the frame
GhostWebPage.switch_to_sub_window(index)

Change the focus to the sub window (popup) :param index: The index of the window, in the order that the window was opened

GhostWebPage.wait_for(condition, timeout_message)

Waits until condition is True.

Parameters:
  • condition – A callable that returns the condition.
  • timeout_message – The exception message on timeout.
GhostWebPage.wait_for_alert()

Waits for main frame alert().

GhostWebPage.wait_for_page_loaded()

Waits until page is loaded, assumed that a page as been requested.

GhostWebPage.wait_for_selector(selector)

Waits until selector match an element on the frame.

Parameters:selector – The selector to wait for.
GhostWebPage.wait_for_text(text)

Waits until given text appear on main frame.

Parameters:text – The text to wait for.

NetworkAccessManager Class:

class ghost.NetworkAccessManager(*args, **kwargs)

NetworkAccessManager manages a QNetworkAccessManager. It’s crate a internal cache and manage all the request.

Parameters:
  • cache_dir – a directory where Ghost is going to put the cache
  • cache_size – the Size of the cache in MB. If it’s 0 the cache it’s automatically disabled.
  • prevent_download – A List of extensions of the files that you want to prevent from downloading
configureProxy(host, port, user=None, password=None)

Add a proxy configuration for the Network Requests.

Parameters:
  • host – the proxy host
  • port – the proxy port
  • user – if the proxy has authentication this param sets the user to be used. It should be None if it’s not required to access with a user
  • password – if the proxy has authentication this param sets the password to be used. It should be None if it’s not required to access with a password
removeProxy()

Removes the proxy configuration

setAuthCredentials(user, password)

Sets or update the auth credentials.

Parameters:
  • user – the username used for the authentication
  • password – the password used for the authentication

PaperSize Class:

class ghost.PaperSize(width, height, margin, orientation=None, page_type=None)

This class tells to the PdfPrinter how to render the webpage

Parameters:
  • width – An int representing the width of the page
  • height – An int representing the height of the page
  • margin – a tuple of ints representing the margins of the page (margin_left, margin_top, margin_right, margin_bottom)
  • orientation – landscape | portrait. This option only makes sense if page_type it’s not None
  • page_type – The format of the page, it can be : A0|A1|A2|A3|A4|A5|A6|A7|A8|A9|B0|B1|B2|B3|B4|B5|B6|B7| B8|B9|B10|C5E|Comm10E|DLE|Executive|Folio|Ledger| Legal|Letter|Tabloid|

BlackPearl Class

Contents:

class ghost.BlackPearl(ghost, pirateClass, port=8000, request_life=300)
process_events()

Main process that manages all the events queued

start()

Start the BlackPearl Server

class ghost.Pirate(ghost)
add_event(method, callback=None, *args, **kwargs)

Add a new event to the event queue. :param method: the method that it’s executed when the event it’s tiggered. :param callback: method that it’s excuted after “method”. It has to return a tuple of (True|False, Object) :param args: It takes a list of params to be passed to ‘method’

event_ready(ev)

This method is used the the execution of the event was ended, it’s handles the result of the event

get_event()

Returns the next event in the queue

get_result()

Return the result of the Ghost scrapping :return: An String with the information obtained

has_events()

Indicate if the class has some event queued

start(data=None)

Add in the queue all the event :param data: An initial information for Ghost

Indices and tables

Table Of Contents

This Page