About Ghost.py
This documentation corresponds to the version of Ghost.py in the following branch
https://github.com/carrerasrodrigo/Ghost.py manteined by Rodrigo Nicolas Carreras
Ghost.py is originally created by Jean-Philippe Serafin. I made a fork of his branch and
I implemented different functionalities since then.
Ghost.py is a Webkit based scriptable web browser for python. It brings you all the
power of WebKit with an api in Python.
Installation
First you need to install PyQt or PySide. This will require to install the QT framework first.
You can find PyQt in the following link http://www.riverbankcomputing.com/software/pyqt/download
Then you need to install Ghost.py and the dependencies for testing, flask and paste.
Installing Flask and Paste:
pip install Flask
pip install paste
For Ghost.py:
pip install -e git+git://github.com/carrerasrodrigo/Ghost.py.git#egg=Ghost.py
or alternatively:
git clone git://github.com/carrerasrodrigo/Ghost.py.git
cd Ghost.py
python setup.py install
Easy peasy!
Examples
Let’s search some planes on ebay:
from ghost import Ghost
url = "http://www.ebay.com/"
gh = Ghost()
# We create a new page
page, page_name = gh.create_page()
# We load the main page of ebay
page_resource = page.open(url, wait_onload_event=True)
# Full the main bar and click on the search button
page.set_field_value("#gh-ac", "plane")
page.click("#gh-btn")
# Wait for the next page
page.wait_for_selector("#e1-15")
# Save the image of the screen
page.capture_to("plane.png")
Some times we need to scrap a website but we don’t need to download
css or js content. This Branch of Ghost.py has many improvements to make your
experience faster than ever. Let’s see another example:
from ghost import Ghost
url = "http://news.ycombinator.com/"
# We enable the cache and set the maximun size to 10 MB
# We don't want to load images and load css or js files
gh = Ghost(cache_size=10, download_images=False,
prevent_download=["css", "js"])
# We create a new page
page, page_name = gh.create_page()
# wait_onload_event will tell to Ghost to leave the open method
# when the On Ready event on the web page has been fired
page_resource = page.open(url, wait_onload_event=False)
# We retrive the links from the web page
links = page.evaluate("""
var links = document.querySelectorAll("a");
var listRet = [];
for (var i=0; i<links.length; i++){
listRet.push(links[i].href);
}
listRet;
""")
# Print the links
for l in links[0]:
print l
Ghost Class
Contents:
-
class ghost.Ghost(user_agent='Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.2 (KHTML, like Gecko) Chrome/15.0.874.121 Safari/535.2', wait_timeout=20, wait_callback=None, log_level=30, display=False, viewport_size=(800, 600), cache_dir='/tmp/ghost.py', cache_size=0, download_images=True, prevent_download=[], share_cookies=True, share_cache=True)
Ghost manages multiple QWebPage’s.
Parameters: |
- user_agent – The default User-Agent header.
- wait_timeout – Maximum step duration in second.
- wait_callback – An optional callable that is periodically
executed until Ghost stops waiting.
- log_level – The optional logging level.
- display – A boolean that tells ghost to displays UI.
- viewport_size – A tupple that sets initial viewport size.
- cache_dir – a directory where Ghost is going to put the cache
- cache_size – the Size of the cache in MB. If it’s 0 the
cache it’s automatically disabled.
- download_images – Indicate if the browser download or not the images
- prevent_download – A List of extensions of the files that you want
to prevent from downloading
- share_cookies – A boolean that indicates if every page created has
to share the same cookie jar. If False every page will have a different
cookie jar
- share_cache – A boolean that indicates if every page created has
to share the same cache directory. If False, cache directory will be called
cache_dir + randomint in order to separate the directories.
|
-
create_page(wait_timeout=20, wait_callback=None, is_popup=False, max_resource_queued=None)
Create a new GhostWebPage
:param wait_timeout: The timeout used when we want to load a new url.
:param wait_callback: An optional callable that is periodically
executed until Ghost stops waiting.
:param is_popup: Indicates if the QWebPage it’s a popup
:param max_resource_queued: Indicates witch it’s the max number of
resources that can be saved in memory. If None then no limits
are applied. If 0 then no resources are kept. If the number
it’s > 0 then the number of resources won’t be more than
max_resource_queued
-
exit()
Exits application and relateds.
-
get_page(index)
Return the indicated GhostWebPage.
:param index: Number of the GhostWebPage
:return: Returns the page if the index exists, None otherwise
-
hide()
Close the webview.
-
remove_page(page)
Destoy the indicated GhostWebPage
:param page: The GhostWebPage that we want to destroy
-
show()
Show current page inside a QWebView.
-
switch_to_page(index)
Return the indicated page and change the focus.
:param index: Number of the GhostWebPage
:return: Returns a GhostWebPage if the index exists, None otherwise
GhostWebPage Class:
-
class ghost.GhostWebPage(app, network_manager, wait_timeout=20, wait_callback=None, viewport_size=(800, 600), user_agent='Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/535.2 (KHTML, like Gecko) Chrome/15.0.874.121 Safari/535.2', log_level=30, download_images=True, create_page_callback=None, is_popup=False, max_resource_queued=None, *args, **kargs)
Overrides QtWebKit.QWebPage in order to intercept some graphical
behaviours like alert(), confirm().
Also intercepts client side console.log().
Parameters: |
- app – a QApplication that it’s running Ghost.
- network_manager – a NetworkManager instance in charge of managing all the network
requests.
- wait_timeout – Maximum step duration in second.
- wait_callback – An optional callable that is periodically
executed until Ghost stops waiting.
- viewport_size – A tupple that sets initial viewport size.
- user_agent – The default User-Agent header.
- log_level – The optional logging level.
- download_images – Indicate if the browser download or not the images
- create_page_callback – A method called when a popup it’s opened
- is_popup – Boolean who indicate if the page it’s a popup
- max_resource_queued – Indicates witch it’s the max number of resources that can be
saved in memory. If None then no limits are applied. If 0 then no resources are kept/
If the number it’s > 0 then the number of resources won’t be more than max_resource_queued
|
-
capture(region=None, selector=None, format=6)
Returns snapshot as QImage.
Parameters: |
- region – An optional tupple containing region as pixel
coodinates.
- selector – A selector targeted the element to crop on.
- format – The output image format.
|
-
capture_to(path, region=None, selector=None, format=6)
Saves snapshot as image.
Parameters: |
- path – The destination path.
- region – An optional tupple containing region as pixel
coodinates.
- selector – A selector targeted the element to crop on.
- format – The output image format.
The available formats can be found here http://qt-project.org/doc/qt-4.8/qimage.html#Format-enum
There is also a “pdf” format that will render the page into a pdf file
|
-
click(*args, **kwargs)
Click the targeted element.
Parameters: | selector – A CSS3 selector to targeted element. |
-
class confirm(confirm=True, callback=None)
Statement that tells Ghost how to deal with javascript confirm().
Parameters: |
- confirm – A bollean that confirm.
- callable – A callable that returns a boolean for confirmation.
|
-
GhostWebPage.content
Returns main_frame HTML as a string.
-
GhostWebPage.cookies
Returns all cookies.
-
GhostWebPage.delete_cookies()
Deletes all cookies.
-
GhostWebPage.evaluate(*args, **kwargs)
Evaluates script in page frame.
Parameters: | script – The script to evaluate. |
-
GhostWebPage.evaluate_js_file(path, encoding='utf-8')
Evaluates javascript file at given path in current frame.
Raises native IOException in case of invalid file.
Parameters: |
- path – The path of the file.
- encoding – The file’s encoding.
|
-
GhostWebPage.exists(selector)
Checks if element exists for given selector.
Parameters: | string – The element selector. |
-
GhostWebPage.fill(*args, **kwargs)
Fills a form with provided values.
Parameters: |
- selector – A CSS selector to the target form to fill.
- values – A dict containing the values.
|
-
GhostWebPage.fire_on(*args, **kwargs)
Call method on element matching given selector.
Parameters: |
- selector – A CSS selector to the target element.
- method – The name of the method to fire.
- expect_loading – Specifies if a page loading is expected.
|
-
GhostWebPage.get_current_frame_content()
Returns current frame HTML as a string.
-
GhostWebPage.global_exists(global_name)
Checks if javascript global exists.
Parameters: | global_name – The name of the global. |
-
GhostWebPage.javaScriptAlert(frame, message)
Notifies ghost for alert, then pass.
-
GhostWebPage.javaScriptConfirm(frame, message)
Checks if ghost is waiting for confirm, then returns the right
value.
-
GhostWebPage.javaScriptConsoleMessage(message, line, source)
Prints client console message in current output stream.
-
GhostWebPage.javaScriptPrompt(frame, message, defaultValue, result=None)
Checks if ghost is waiting for prompt, then enters the right
value.
-
GhostWebPage.open(address, method='get', headers={}, auth=None, wait_onload_event=True, wait_for_loading=True)
Opens a web page.
Parameters: |
- address – The resource URL.
- method – The Http method.
- headers – An optional dict of extra request hearders.
- auth – An optional tupple of HTTP auth (username, password).
- wait_onload_event – If it’s set to True waits until the OnLoad event from
the main page is fired. Otherwise wait until the Dom is ready.
- wait_for_loading – If True waits until the page is Loaded. Note that wait_onload_event
isn’t valid if wait_for_loading is False.
|
Returns: | Page resource, All loaded resources.
|
-
class GhostWebPage.prompt(value='', callback=None)
Statement that tells Ghost how to deal with javascript prompt().
Parameters: |
- value – A string value to fill in prompt.
- callback – A callable that returns the value to fill in.
|
-
GhostWebPage.region_for_selector(*args, **kwargs)
Returns frame region for given selector as tupple.
Parameters: | selector – The targeted element. |
-
GhostWebPage.set_field_value(*args, **kwargs)
Sets the value of the field matched by given selector.
Parameters: |
- selector – A CSS selector that target the field.
- value – The value to fill in.
- blur – An optional boolean that force blur when filled in.
|
-
GhostWebPage.set_viewport_size(width, height)
Sets the page viewport size.
Parameters: |
- width – An integer that sets width pixel count.
- height – An integer that sets height pixel count.
|
-
GhostWebPage.switch_to_frame(frameName=None)
Change the focus to the indicated frame
Parameters: | frameName – The name of the frame |
-
GhostWebPage.switch_to_frame_nro(nro=-1)
Change the focus to the indicated frame
Parameters: | nro – Number of the frame |
-
GhostWebPage.switch_to_sub_window(index)
Change the focus to the sub window (popup)
:param index: The index of the window, in the order that the
window was opened
-
GhostWebPage.wait_for(condition, timeout_message)
Waits until condition is True.
Parameters: |
- condition – A callable that returns the condition.
- timeout_message – The exception message on timeout.
|
-
GhostWebPage.wait_for_alert()
Waits for main frame alert().
-
GhostWebPage.wait_for_page_loaded()
Waits until page is loaded, assumed that a page as been requested.
-
GhostWebPage.wait_for_selector(selector)
Waits until selector match an element on the frame.
Parameters: | selector – The selector to wait for. |
-
GhostWebPage.wait_for_text(text)
Waits until given text appear on main frame.
Parameters: | text – The text to wait for. |
NetworkAccessManager Class:
-
class ghost.NetworkAccessManager(*args, **kwargs)
NetworkAccessManager manages a QNetworkAccessManager. It’s
crate a internal cache and manage all the request.
Parameters: |
- cache_dir – a directory where Ghost is going to put the cache
- cache_size – the Size of the cache in MB. If it’s 0 the
cache it’s automatically disabled.
- prevent_download – A List of extensions of the files that you want
to prevent from downloading
|
-
configureProxy(host, port, user=None, password=None)
Add a proxy configuration for the Network Requests.
Parameters: |
- host – the proxy host
- port – the proxy port
- user – if the proxy has authentication this param sets
the user to be used. It should be None if it’s not required to
access with a user
- password – if the proxy has authentication this param sets
the password to be used. It should be None if it’s not required to
access with a password
|
-
removeProxy()
Removes the proxy configuration
-
setAuthCredentials(user, password)
Sets or update the auth credentials.
Parameters: |
- user – the username used for the authentication
- password – the password used for the authentication
|
PaperSize Class:
-
class ghost.PaperSize(width, height, margin, orientation=None, page_type=None)
This class tells to the PdfPrinter how to render the webpage
Parameters: |
- width – An int representing the width of the page
- height – An int representing the height of the page
- margin – a tuple of ints representing the margins of the page
(margin_left, margin_top, margin_right, margin_bottom)
- orientation – landscape | portrait. This option only makes
sense if page_type it’s not None
- page_type – The format of the page, it can be :
A0|A1|A2|A3|A4|A5|A6|A7|A8|A9|B0|B1|B2|B3|B4|B5|B6|B7|
B8|B9|B10|C5E|Comm10E|DLE|Executive|Folio|Ledger|
Legal|Letter|Tabloid|
|
BlackPearl Class
Contents:
-
class ghost.BlackPearl(ghost, pirateClass, port=8000, request_life=300)
-
process_events()
Main process that manages all the events queued
-
start()
Start the BlackPearl Server
-
class ghost.Pirate(ghost)
-
add_event(method, callback=None, *args, **kwargs)
Add a new event to the event queue.
:param method: the method that it’s executed when the event it’s tiggered.
:param callback: method that it’s excuted after “method”.
It has to return a tuple of (True|False, Object)
:param args: It takes a list of params to be passed to ‘method’
-
event_ready(ev)
This method is used the the execution of the event was ended, it’s handles
the result of the event
-
get_event()
Returns the next event in the queue
-
get_result()
Return the result of the Ghost scrapping
:return: An String with the information obtained
-
has_events()
Indicate if the class has some event queued
-
start(data=None)
Add in the queue all the event
:param data: An initial information for Ghost